The document discusses the importance of centralized event collection and analysis using a big data platform. It describes the challenges faced by MakeMyTrip in analyzing huge amounts of data from various sources. Centralized logging of structured event data from all systems and applications is recommended to enable effective log analysis, troubleshooting, and personalizing the user experience. A data service platform is needed to integrate data from different sources and power real-time and batch processing for analytics and insights.
It's a wrap - closing keynote for nlOUG Tech Experience 2017 (16th June, The ...Lucas Jellema
Closing keynote for the Tech Experience 2017 conference in Amersfoort, The Netherlands (16th June 2017). Touches upon the role of The Oracle Database in a changing landscape with NoSQL, CQRS, REST & JSON, Hadoop and Elastic Search. Discusses the gaps that Oracle professionals have to bridge in order to broaden their horizon and prepare for the (near) future. The session discusses the cloud - and how it will impact most organizations and Oracle specialists. It summarizes the main topics and themes from the Tech Experience 2017 conference.
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetHostedbyConfluent
Streaming data systems have been growing rapidly in importance to the modern data stack. Kafka’s kSQL provides an interface for analytic tools that speak SQL. Apache Superset, the most popular modern open-source visualization and analytics solution, plugs into nearly any data source that speaks SQL, including Kafka. Here, we review and compare methods for connecting Kafka to Superset to enable streaming analytics use cases including anomaly detection, operational monitoring, and online data integration.
Building Complete Private Clouds with Apache CloudStack and Riak CSJohn Burwell
IT infrastructure rigidity has emerged as one of the leading barriers to achieving the cost efficiency and operational agility required to drive growth. While public cloud such as Amazon Web Services and Rackspace provide this agility, many organizations are precluded from utilizing them due to regulatory, security, performance, and/or existing investments. For these organizations to realize these agility benefits, they must transform their private infrastructures to embrace public cloud principles.
During this session, we will explore cloud system architecture principles and best practices. Using the Apache CloudStack cloud orchestration platform and Basho’s Riak CS object store, a complete, open source private cloud will be realized that creates the operational agility and cost-reduction benefits of public clouds.
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)Lucas Jellema
Microservices are independent, encapsulated entities that produce meaningful results and business functionality in tentative collaboration. Events and pub/sub are great for allowing such decoupled interaction. Using Apache Kafka as robust, distributed, real-time, high volume event bus, this session demonstrates how microservices packaged with Docker and implemented in Java, Node, Python and SQL collaborate unknowingly. The microservices respond to social (media) events - courtesy of IFTTT - and publish results to multiple channels. The event bus operates across cloud services and on premises platforms such as Kubernetes: both the bus and the microservices can run anywhere. A microservices platform is discussed with generic capabilities.
Outline: presentation summary
- intro microservices objectives, focus on decoupled collaboration
- demo four mservices in different technologies (Node, Java, ...) ; no direct dependencies; show the code (running on its own), show the packing into a container and the step of running the containers on a container management platform, using both Kubernetes and a Container Cloud Service (later on this will further the point of collaborating between microservices that are widely separated)
- discuss generic capabilities of a microservices platform (facilities required in many microservices that should be available as microservice - such as cache, log, authenticate (and compare with Java EE application server)
- demo a microservice providing a generic cache functionality (based on MongoDB)
- outline the desired choreography (a four step workflow that requires participation from various microservices); briefly discuss routing slips and the Saga pattern
- discuss use of events and need of event bus
- intro Kafka
- demo pub and sub from each mservice to Kafka
- link IFTTT to Kafka (for demo: use ngrok to expose local Kafka to IFTTT cloud)
- demo end-to-end Social event=>IFTTT=>Kafka=>choreographed mservices=> final result
- demo: extend one of the microservices: change the code, package a new container image version and update the running version in the container platform; demonstrate that new workflows leverage the new version
- demo: move a microservice from on premises to cloud - showing that the decoupled nature of the mservices mean that this move does not have any impact
- demo: show a change in the logic of the routing slip; none of the mservices require any change for a changed workflow choreography to be executed
- discuss cloud deployment of event bus + mservices
Breaking the Monolith: Organizing Your Team to Embrace MicroservicesPaul Osman
Microservices are becoming an increasingly popular way to build software systems. Thanks to evangelism from companies like Netflix, Amazon, Gilt, ThoughtWorks and SoundCloud, more organizations are considering whether or not they should adopt this practice.
In this talk, I’ll discuss our experiences evolving 500px from a single, monolithic Ruby on Rails application to a series of composable microservices written in Ruby and Go. I’ll talk about the challenges we faced from a business, engineering, QA and operations perspective and how moving to microservices encouraged (or required) change in our organizational structure and culture.
In this talk, you’ll learn how a change in how we develop software affected team structures, development environments, testing infrastructure and encouraged us to explore moving to cloud hosting and to move closer to continuous delivery. You’ll also learn about the pitfalls, both expected and unexpected that we experienced along the way.
By sharing some of our experiences, I hope to provide some guidance to engineering teams considering whether or not to adopt microservices.
Real time Messages at Scale with Apache Kafka and CouchbaseWill Gardella
Kafka is a scalable, distributed publish subscribe messaging system that's used as a data transmission backbone in many data intensive digital businesses. Couchbase Server is a scalable, flexible document database that's fast, agile, and elastic. Because they both appeal to the same type of customers, Couchbase and Kafka are often used together.
This presentation from a meetup in Mountain View describes Kafka's design and why people use it, Couchbase Server and its uses, and the use cases for both together. Also covered is a description and demo of Couchbase Server writing documents to a Kafka topic and consuming messages from a Kafka topic. using the Couchbase Kafka Connector.
It's a wrap - closing keynote for nlOUG Tech Experience 2017 (16th June, The ...Lucas Jellema
Closing keynote for the Tech Experience 2017 conference in Amersfoort, The Netherlands (16th June 2017). Touches upon the role of The Oracle Database in a changing landscape with NoSQL, CQRS, REST & JSON, Hadoop and Elastic Search. Discusses the gaps that Oracle professionals have to bridge in order to broaden their horizon and prepare for the (near) future. The session discusses the cloud - and how it will impact most organizations and Oracle specialists. It summarizes the main topics and themes from the Tech Experience 2017 conference.
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetHostedbyConfluent
Streaming data systems have been growing rapidly in importance to the modern data stack. Kafka’s kSQL provides an interface for analytic tools that speak SQL. Apache Superset, the most popular modern open-source visualization and analytics solution, plugs into nearly any data source that speaks SQL, including Kafka. Here, we review and compare methods for connecting Kafka to Superset to enable streaming analytics use cases including anomaly detection, operational monitoring, and online data integration.
Building Complete Private Clouds with Apache CloudStack and Riak CSJohn Burwell
IT infrastructure rigidity has emerged as one of the leading barriers to achieving the cost efficiency and operational agility required to drive growth. While public cloud such as Amazon Web Services and Rackspace provide this agility, many organizations are precluded from utilizing them due to regulatory, security, performance, and/or existing investments. For these organizations to realize these agility benefits, they must transform their private infrastructures to embrace public cloud principles.
During this session, we will explore cloud system architecture principles and best practices. Using the Apache CloudStack cloud orchestration platform and Basho’s Riak CS object store, a complete, open source private cloud will be realized that creates the operational agility and cost-reduction benefits of public clouds.
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)Lucas Jellema
Microservices are independent, encapsulated entities that produce meaningful results and business functionality in tentative collaboration. Events and pub/sub are great for allowing such decoupled interaction. Using Apache Kafka as robust, distributed, real-time, high volume event bus, this session demonstrates how microservices packaged with Docker and implemented in Java, Node, Python and SQL collaborate unknowingly. The microservices respond to social (media) events - courtesy of IFTTT - and publish results to multiple channels. The event bus operates across cloud services and on premises platforms such as Kubernetes: both the bus and the microservices can run anywhere. A microservices platform is discussed with generic capabilities.
Outline: presentation summary
- intro microservices objectives, focus on decoupled collaboration
- demo four mservices in different technologies (Node, Java, ...) ; no direct dependencies; show the code (running on its own), show the packing into a container and the step of running the containers on a container management platform, using both Kubernetes and a Container Cloud Service (later on this will further the point of collaborating between microservices that are widely separated)
- discuss generic capabilities of a microservices platform (facilities required in many microservices that should be available as microservice - such as cache, log, authenticate (and compare with Java EE application server)
- demo a microservice providing a generic cache functionality (based on MongoDB)
- outline the desired choreography (a four step workflow that requires participation from various microservices); briefly discuss routing slips and the Saga pattern
- discuss use of events and need of event bus
- intro Kafka
- demo pub and sub from each mservice to Kafka
- link IFTTT to Kafka (for demo: use ngrok to expose local Kafka to IFTTT cloud)
- demo end-to-end Social event=>IFTTT=>Kafka=>choreographed mservices=> final result
- demo: extend one of the microservices: change the code, package a new container image version and update the running version in the container platform; demonstrate that new workflows leverage the new version
- demo: move a microservice from on premises to cloud - showing that the decoupled nature of the mservices mean that this move does not have any impact
- demo: show a change in the logic of the routing slip; none of the mservices require any change for a changed workflow choreography to be executed
- discuss cloud deployment of event bus + mservices
Breaking the Monolith: Organizing Your Team to Embrace MicroservicesPaul Osman
Microservices are becoming an increasingly popular way to build software systems. Thanks to evangelism from companies like Netflix, Amazon, Gilt, ThoughtWorks and SoundCloud, more organizations are considering whether or not they should adopt this practice.
In this talk, I’ll discuss our experiences evolving 500px from a single, monolithic Ruby on Rails application to a series of composable microservices written in Ruby and Go. I’ll talk about the challenges we faced from a business, engineering, QA and operations perspective and how moving to microservices encouraged (or required) change in our organizational structure and culture.
In this talk, you’ll learn how a change in how we develop software affected team structures, development environments, testing infrastructure and encouraged us to explore moving to cloud hosting and to move closer to continuous delivery. You’ll also learn about the pitfalls, both expected and unexpected that we experienced along the way.
By sharing some of our experiences, I hope to provide some guidance to engineering teams considering whether or not to adopt microservices.
Real time Messages at Scale with Apache Kafka and CouchbaseWill Gardella
Kafka is a scalable, distributed publish subscribe messaging system that's used as a data transmission backbone in many data intensive digital businesses. Couchbase Server is a scalable, flexible document database that's fast, agile, and elastic. Because they both appeal to the same type of customers, Couchbase and Kafka are often used together.
This presentation from a meetup in Mountain View describes Kafka's design and why people use it, Couchbase Server and its uses, and the use cases for both together. Also covered is a description and demo of Couchbase Server writing documents to a Kafka topic and consuming messages from a Kafka topic. using the Couchbase Kafka Connector.
WSO2Con ASIA 2016: Building Apps Using WSO2 App Dev PlatformWSO2
Are you trying to build an application that is scalable, highly available, secure and within budget? Do you want to use microservice architecture? Do you need to integrate several systems? Are you wondering how to build a beautiful dashboard or a cloud based SaaS? Do you wish to expose industry grade APIs? Look no further. The WSO2 App Dev platform can help you to achieve these objectives while cutting down time to market. The platform is accompanied by a Devx-tuned set of tools to increase the development productivity and exploit the fullest potential of its app dev capabilities. This session will discuss how you can build apps using this platform.
Keep your Metadata Repository Current with Event-Driven Updates using CDC and...confluent
The data science techniques and machine learning models that provide the greatest business value and insights require data that spans enterprise silos. To integrate this data, and ensure you’re joining on the right fields, you need a comprehensive, enterprise-wide metadata repository. More importantly, you need it to be always up to date. Nightly updates are simply not good enough when customers and users expect near-real-time responsiveness.
The challenge with keeping a metadata repository up to date lies not with cloud services or distributed storage frameworks, but rather with the relational database management systems (RDBMSs) that dot the enterprise landscape. At Comcast, we’ve found it relatively easy to feed our Apache Atlas metadata repo incrementally from Hadoop and AWS, using event-driven pushes to a dedicated Apache Kafka topic that Atlas listens to. Such pushes are not practical with RDBMSs, however, since the event-driven technique there is the database trigger. Triggers are so invasive and potentially detrimental to performance that your DB admin likely won’t allow one for detecting metadata changes.
Triggers are out. Pulling the complete current state of metadata from a RDBMS at regular intervals and calculating the deltas is too slow and unworkable. And, it turns out that out-of-the-box log-based change data capture (CDC) is also dead-end because metadata changes are represented in transaction logs as SQL DDL strings, not as atomic insert/update/delete operations as for data.
So, how do you keep your metadata repository always up to date with the current state of your RDBMS metadata? Our group solved this challenge by creating an alternate method for CDC on RDBMS metadata based on database system tables. Our query-based CDC serves as a Kafka Connect source for our Apache Atlas sink, providing event-driven, continuous updates to RDBMS metadata in our repository, but does not suffer from the usual limitations/disadvantages of vanilla query-based CDC. If you’re facing a similar challenge, join us at this session to learn more about the obstacles you’ll likely face and how you can overcome them using the method we implemented.
A Practical Guide to Selecting a Stream Processing Technology confluent
Presented by Michael Noll, Product Manager, Confluent.
Why are there so many stream processing frameworks that each define their own terminology? Are the components of each comparable? Why do you need to know about spouts or DStreams just to process a simple sequence of records? Depending on your application’s requirements, you may not need a full framework at all.
Processing and understanding your data to create business value is the ultimate goal of a stream data platform. In this talk we will survey the stream processing landscape, the dimensions along which to evaluate stream processing technologies, and how they integrate with Apache Kafka. Particularly, we will learn how Kafka Streams, the built-in stream processing engine of Apache Kafka, compares to other stream processing systems that require a separate processing infrastructure.
Introduction to Apache Kafka and Confluent... and why they matterconfluent
Milano Apache Kafka Meetup by Confluent (First Italian Kafka Meetup) on Wednesday, November 29th 2017.
Il talk introduce Apache Kafka (incluse le APIs Kafka Connect e Kafka Streams), Confluent (la società creata dai creatori di Kafka) e spiega perché Kafka è un'ottima e semplice soluzione per la gestione di stream di dati nel contesto di due delle principali forze trainanti e trend industriali: Internet of Things (IoT) e Microservices.
OCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platformMarc Dutoo
OCCIware at Paris Open Source Summit 2016 - an extensible, standard XaaS cloud consumer platform - demos : Docker & Linked Data Studios, online playground
Change data capture with MongoDB and Kafka.Dan Harvey
In any modern web platform you end up with a need to store different views of your data in many different datastores. I will cover how we have coped with doing this in a reliable way at State.com across a range of different languages, tools and datastores.
Mainframe Integration, Offloading and Replacement with Apache Kafka | Kai Wae...HostedbyConfluent
Legacy migration is a journey. Mainframes cannot be replaced in a single project. A big bang will fail. This has to be planned long-term.
Mainframe offloading and replacement with Apache Kafka and its ecosystem can be used to keep a more modern data store in real-time sync with the mainframe, while at the same time persisting the event data on the bus to enable microservices, and deliver the data to other systems such as data warehouses and search indexes.
This session walks through the different steps some companies are already gone through. Technical options like Change Data Capture (CDC), MQ, and third-party tools for mainframe integration, offloading and replacement are explored.
My team at Zalando fell in love with KStreams and their programming model straight out of the gate. However, as a small team of developers, building out and supporting our infrastructure while still trying to deliver solutions for our business has not always resulted in a smooth journey. Can a small team of a couple of developers run their own Kafka infrastructure confidently and still spend most of their time developing code? In this talk, we will dive into some of the problems we experienced while running Kafka brokers and Kafka streams applications, as well as the consultations we had with other teams around this matter. We will outline some of the pragmatic decisions we made regarding backups, monitoring and operations to minimize our time spent administering our Kafka brokers and various stream applications.
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at UberWSO2
The Marketplace data team at Uber has built a scalable complex event processing platform to solve many challenging real-time data needs for various Uber products. This platform has been in production for more than a year and supports over 100 real-time data use cases with a team of 3. In this talk, we will share the detail of the design and our experience, and how we employ Siddhi, Kafka and Samza at scale.
Scala eXchange: Building robust data pipelines in ScalaAlexander Dean
Over the past couple of years, Scala has become a go-to language for building data processing applications, as evidenced by the emerging ecosystem of frameworks and tools including LinkedIn's Kafka, Twitter's Scalding and our own Snowplow project (https://github.com/snowplow/snowplow).
In this talk, Alex will draw on his experiences at Snowplow to explore how to build rock-sold data pipelines in Scala, highlighting a range of techniques including:
* Translating the Unix stdin/out/err pattern to stream processing
* "Railway oriented" programming using the Scalaz Validation
* Validating data structures with JSON Schema
* Visualizing event stream processing errors in ElasticSearch
Alex's talk draws on his experiences working with event streams in Scala over the last two and a half years at Snowplow, and by Alex's recent work penning Unified Log Processing, a Manning book.
URP? Excuse You! The Three Metrics You Have to Know confluent
(Todd Palino, LinkedIn) Kafka Summit SF 2018
What do you really know about how to monitor a Kafka cluster for problems? Is your most reliable monitoring your users telling you there’s something broken? Are you capturing more metrics than the actual data being produced? Sure, we all know how to monitor disk and network, but when it comes to the state of the brokers, many of us are still unsure of which metrics we should be watching, and what their patterns mean for the state of the cluster. Kafka has hundreds of measurements, from the high-level numbers that are often meaningless to the per-partition metrics that stack up by the thousands as our data grows.
We will thoroughly explore three key monitoring concepts in the broker, that will leave you an expert in identifying problems with the least amount of pain:
-Under-replicated Partitions: The mother of all metrics
-Request Latencies: Why your users complain
-Thread pool utilization: How could 80% be a problem?
We will also discuss the necessity of availability monitoring and how to use it to get a true picture of what your users see, before they come beating down your door!
Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...HostedbyConfluent
You have learned about Kafka event sourcing with streams and using Kafka as a database, but you may be having a tough time wrapping your head around what that means and what challenges you will face. Kafka’s exactly once semantics, data retention rules, and stream DSL make it a great database for real-time transaction processing. This talk will focus on how to use Kafka events as a database. We will talk about using KTables vs GlobalKTables, and how to apply them to patterns we use with traditional databases. We will go over a real-world example of joining events against existing data and some issues to be aware of. We will finish covering some important things to remember about state stores, partitions, and streams to help you avoid problems when your data sets become large.
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applicationsconfluent
When you are running systems in production, clearly you want to make sure they are up and running at all times. But in a distributed system such as Apache Kafka… what does “up and running” even mean?
Experienced Apache Kafka users know what is important to monitor, which alerts are critical and how to respond to them. They don’t just collect metrics - they go the extra mile and use additional tools to validate availability and performance on both the Kafka cluster and their entire data pipelines.
In this presentation we’ll discuss best practices of monitoring Apache Kafka. We’ll look at which metrics are critical to alert on, which are useful in troubleshooting and what may actually be misleading. We’ll review a few “worst practices” - common mistakes that you should avoid. We’ll then look at what metrics don’t tell you - and how to cover those essential gaps.
When the Cloud is a Rockin: High Availability in Apache CloudStackJohn Burwell
CloudStack currently provides a variety bespoke high availability mechanisms for resources such as virtual machines, hosts, and virtual routers. Each of these implementations duplicates the HA check/recovery cycle, as well as, concurrency, persistence, and clustering required manage high available for any CloudStack resource. The High Availability Resource Management Service has been developed to consolidate these concerns -- providing a robust, extensible HA mechanism. Using this service, plugins only need to define health check, activity check, and fence operations.
"In love with Open Source : Past, Present and Future" : Keynote OSDConf 2014Piyush Kumar
OPEN SOURCE DEVELOPERS CONFERENCE http://osdconf.in/
★ April 26-27th, Noida ★
Keynote Session By Piyush Kumar (Lead of Infrastructure and Website Operations at MakeMyTrip)
WSO2Con ASIA 2016: Building Apps Using WSO2 App Dev PlatformWSO2
Are you trying to build an application that is scalable, highly available, secure and within budget? Do you want to use microservice architecture? Do you need to integrate several systems? Are you wondering how to build a beautiful dashboard or a cloud based SaaS? Do you wish to expose industry grade APIs? Look no further. The WSO2 App Dev platform can help you to achieve these objectives while cutting down time to market. The platform is accompanied by a Devx-tuned set of tools to increase the development productivity and exploit the fullest potential of its app dev capabilities. This session will discuss how you can build apps using this platform.
Keep your Metadata Repository Current with Event-Driven Updates using CDC and...confluent
The data science techniques and machine learning models that provide the greatest business value and insights require data that spans enterprise silos. To integrate this data, and ensure you’re joining on the right fields, you need a comprehensive, enterprise-wide metadata repository. More importantly, you need it to be always up to date. Nightly updates are simply not good enough when customers and users expect near-real-time responsiveness.
The challenge with keeping a metadata repository up to date lies not with cloud services or distributed storage frameworks, but rather with the relational database management systems (RDBMSs) that dot the enterprise landscape. At Comcast, we’ve found it relatively easy to feed our Apache Atlas metadata repo incrementally from Hadoop and AWS, using event-driven pushes to a dedicated Apache Kafka topic that Atlas listens to. Such pushes are not practical with RDBMSs, however, since the event-driven technique there is the database trigger. Triggers are so invasive and potentially detrimental to performance that your DB admin likely won’t allow one for detecting metadata changes.
Triggers are out. Pulling the complete current state of metadata from a RDBMS at regular intervals and calculating the deltas is too slow and unworkable. And, it turns out that out-of-the-box log-based change data capture (CDC) is also dead-end because metadata changes are represented in transaction logs as SQL DDL strings, not as atomic insert/update/delete operations as for data.
So, how do you keep your metadata repository always up to date with the current state of your RDBMS metadata? Our group solved this challenge by creating an alternate method for CDC on RDBMS metadata based on database system tables. Our query-based CDC serves as a Kafka Connect source for our Apache Atlas sink, providing event-driven, continuous updates to RDBMS metadata in our repository, but does not suffer from the usual limitations/disadvantages of vanilla query-based CDC. If you’re facing a similar challenge, join us at this session to learn more about the obstacles you’ll likely face and how you can overcome them using the method we implemented.
A Practical Guide to Selecting a Stream Processing Technology confluent
Presented by Michael Noll, Product Manager, Confluent.
Why are there so many stream processing frameworks that each define their own terminology? Are the components of each comparable? Why do you need to know about spouts or DStreams just to process a simple sequence of records? Depending on your application’s requirements, you may not need a full framework at all.
Processing and understanding your data to create business value is the ultimate goal of a stream data platform. In this talk we will survey the stream processing landscape, the dimensions along which to evaluate stream processing technologies, and how they integrate with Apache Kafka. Particularly, we will learn how Kafka Streams, the built-in stream processing engine of Apache Kafka, compares to other stream processing systems that require a separate processing infrastructure.
Introduction to Apache Kafka and Confluent... and why they matterconfluent
Milano Apache Kafka Meetup by Confluent (First Italian Kafka Meetup) on Wednesday, November 29th 2017.
Il talk introduce Apache Kafka (incluse le APIs Kafka Connect e Kafka Streams), Confluent (la società creata dai creatori di Kafka) e spiega perché Kafka è un'ottima e semplice soluzione per la gestione di stream di dati nel contesto di due delle principali forze trainanti e trend industriali: Internet of Things (IoT) e Microservices.
OCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platformMarc Dutoo
OCCIware at Paris Open Source Summit 2016 - an extensible, standard XaaS cloud consumer platform - demos : Docker & Linked Data Studios, online playground
Change data capture with MongoDB and Kafka.Dan Harvey
In any modern web platform you end up with a need to store different views of your data in many different datastores. I will cover how we have coped with doing this in a reliable way at State.com across a range of different languages, tools and datastores.
Mainframe Integration, Offloading and Replacement with Apache Kafka | Kai Wae...HostedbyConfluent
Legacy migration is a journey. Mainframes cannot be replaced in a single project. A big bang will fail. This has to be planned long-term.
Mainframe offloading and replacement with Apache Kafka and its ecosystem can be used to keep a more modern data store in real-time sync with the mainframe, while at the same time persisting the event data on the bus to enable microservices, and deliver the data to other systems such as data warehouses and search indexes.
This session walks through the different steps some companies are already gone through. Technical options like Change Data Capture (CDC), MQ, and third-party tools for mainframe integration, offloading and replacement are explored.
My team at Zalando fell in love with KStreams and their programming model straight out of the gate. However, as a small team of developers, building out and supporting our infrastructure while still trying to deliver solutions for our business has not always resulted in a smooth journey. Can a small team of a couple of developers run their own Kafka infrastructure confidently and still spend most of their time developing code? In this talk, we will dive into some of the problems we experienced while running Kafka brokers and Kafka streams applications, as well as the consultations we had with other teams around this matter. We will outline some of the pragmatic decisions we made regarding backups, monitoring and operations to minimize our time spent administering our Kafka brokers and various stream applications.
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at UberWSO2
The Marketplace data team at Uber has built a scalable complex event processing platform to solve many challenging real-time data needs for various Uber products. This platform has been in production for more than a year and supports over 100 real-time data use cases with a team of 3. In this talk, we will share the detail of the design and our experience, and how we employ Siddhi, Kafka and Samza at scale.
Scala eXchange: Building robust data pipelines in ScalaAlexander Dean
Over the past couple of years, Scala has become a go-to language for building data processing applications, as evidenced by the emerging ecosystem of frameworks and tools including LinkedIn's Kafka, Twitter's Scalding and our own Snowplow project (https://github.com/snowplow/snowplow).
In this talk, Alex will draw on his experiences at Snowplow to explore how to build rock-sold data pipelines in Scala, highlighting a range of techniques including:
* Translating the Unix stdin/out/err pattern to stream processing
* "Railway oriented" programming using the Scalaz Validation
* Validating data structures with JSON Schema
* Visualizing event stream processing errors in ElasticSearch
Alex's talk draws on his experiences working with event streams in Scala over the last two and a half years at Snowplow, and by Alex's recent work penning Unified Log Processing, a Manning book.
URP? Excuse You! The Three Metrics You Have to Know confluent
(Todd Palino, LinkedIn) Kafka Summit SF 2018
What do you really know about how to monitor a Kafka cluster for problems? Is your most reliable monitoring your users telling you there’s something broken? Are you capturing more metrics than the actual data being produced? Sure, we all know how to monitor disk and network, but when it comes to the state of the brokers, many of us are still unsure of which metrics we should be watching, and what their patterns mean for the state of the cluster. Kafka has hundreds of measurements, from the high-level numbers that are often meaningless to the per-partition metrics that stack up by the thousands as our data grows.
We will thoroughly explore three key monitoring concepts in the broker, that will leave you an expert in identifying problems with the least amount of pain:
-Under-replicated Partitions: The mother of all metrics
-Request Latencies: Why your users complain
-Thread pool utilization: How could 80% be a problem?
We will also discuss the necessity of availability monitoring and how to use it to get a true picture of what your users see, before they come beating down your door!
Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...HostedbyConfluent
You have learned about Kafka event sourcing with streams and using Kafka as a database, but you may be having a tough time wrapping your head around what that means and what challenges you will face. Kafka’s exactly once semantics, data retention rules, and stream DSL make it a great database for real-time transaction processing. This talk will focus on how to use Kafka events as a database. We will talk about using KTables vs GlobalKTables, and how to apply them to patterns we use with traditional databases. We will go over a real-world example of joining events against existing data and some issues to be aware of. We will finish covering some important things to remember about state stores, partitions, and streams to help you avoid problems when your data sets become large.
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applicationsconfluent
When you are running systems in production, clearly you want to make sure they are up and running at all times. But in a distributed system such as Apache Kafka… what does “up and running” even mean?
Experienced Apache Kafka users know what is important to monitor, which alerts are critical and how to respond to them. They don’t just collect metrics - they go the extra mile and use additional tools to validate availability and performance on both the Kafka cluster and their entire data pipelines.
In this presentation we’ll discuss best practices of monitoring Apache Kafka. We’ll look at which metrics are critical to alert on, which are useful in troubleshooting and what may actually be misleading. We’ll review a few “worst practices” - common mistakes that you should avoid. We’ll then look at what metrics don’t tell you - and how to cover those essential gaps.
When the Cloud is a Rockin: High Availability in Apache CloudStackJohn Burwell
CloudStack currently provides a variety bespoke high availability mechanisms for resources such as virtual machines, hosts, and virtual routers. Each of these implementations duplicates the HA check/recovery cycle, as well as, concurrency, persistence, and clustering required manage high available for any CloudStack resource. The High Availability Resource Management Service has been developed to consolidate these concerns -- providing a robust, extensible HA mechanism. Using this service, plugins only need to define health check, activity check, and fence operations.
"In love with Open Source : Past, Present and Future" : Keynote OSDConf 2014Piyush Kumar
OPEN SOURCE DEVELOPERS CONFERENCE http://osdconf.in/
★ April 26-27th, Noida ★
Keynote Session By Piyush Kumar (Lead of Infrastructure and Website Operations at MakeMyTrip)
Slides from my talk at SiliconIndia OSS conference in Bangalore Nov 2012. Talks about use of OSS as a competitive advantage, especially in areas like eCommerce citing Flipkart as an example.
Celery is a really good framework for doing background task processing in Python (and other languages). While it is ridiculously easy to use celery, doing complex task flow has been a challenge in celery. (w.r.t task trees/graphs/dependecies etc.)
This talk introduces the audience to these challenges in celery and also explains how these can be fixed programmatically and by using latest features in Celery (3+)
As businesses interact with partners, developers, and internal resources through new digital experiences, the number of sources of data expands. Enterprises strive to make sense of the new world of big and broad data as structured internal systems meet the ocean of unstructured and semi-structured data. Purpose-built for the new digital economy, Apigee Insights is a flexible broad data platform, supporting self-service and programmatic access.
The latest distributed system utilizing the cloud is a very complicated configuration in which the components span a plurality of components. Applications for customers are part of products, and service quality targets directly linked to business indicators are needed. Legacy monitoring system based on traditional system management is not linked not only to business indicators but also to measure service quality. Google advocates the idea of site reliability engineering (SRE) and introduces efforts to measure quality of service. Based on the concept of SRE, the service quality monitoring system collects and analyzes logs from various components not only application codes but also whole infrastructure components. Since very large amounts of data must be processed in real time, it is necessary to design carefully with reference to the big data architecture. To utilize this system, you can measure the quality of service, and make it possible to continuously improve the service quality.
Gain New Insights by Analyzing Machine Logs using Machine Data Analytics and BigInsights.
Half of Fortune 500 companies experience more than 80 hours of system down time annually. Spread evenly over a year, that amounts to approximately 13 minutes every day. As a consumer, the thought of online bank operations being inaccessible so frequently is disturbing. As a business owner, when systems go down, all processes come to a stop. Work in progress is destroyed and failure to meet SLA’s and contractual obligations can result in expensive fees, adverse publicity, and loss of current and potential future customers. Ultimately the inability to provide a reliable and stable system results in loss of $$$’s. While the failure of these systems is inevitable, the ability to timely predict failures and intercept them before they occur is now a requirement.
A possible solution to the problem can be found is in the huge volumes of diagnostic big data generated at hardware, firmware, middleware, application, storage and management layers indicating failures or errors. Machine analysis and understanding of this data is becoming an important part of debugging, performance analysis, root cause analysis and business analysis. In addition to preventing outages, machine data analysis can also provide insights for fraud detection, customer retention and other important use cases.
Impact 2013 2963 - IBM Business Process Manager Top PracticesBrian Petrini
Process efficiency remains the top priority of IT executives around the world. To help you succeed, IBM has collected a number of key top practices that have proven to be the necessary ingredient of any success story with the market leading process management solution ? IBM Business Process Manager. Placed in the context of an end-to-end BPM solution lifecycle, this session will focus on key infrastructure, administration, and operational top practices for IBM BPM Standard and Advanced, as distilled by lead IBM practitioners based experiences with projects world-wide. By the end of the session you will have the top tips on: setting up development environments, critical points on keeping your IBM BPM infrastructure scalable, performance tuning, as well access to top intellectual capital in this area.
Machine Data Is EVERYWHERE: Use It for TestingTechWell
As more applications are hosted on servers, they produce immense quantities of logging data. Quality engineers should verify that apps are producing log data that is existent, correct, consumable, and complete. Otherwise, apps in production are not easily monitored, have issues that are difficult to detect, and cannot be corrected quickly. Tom Chavez presents the four steps that quality engineers should include in every test plan for apps that produce log output or other machine data. First, test that the data is being created. Second, ensure that the entries are correctly formatted and complete. Third, make sure the data can be consumed by your company’s log analysis tools. And fourth, verify that the app will create all possible log entries from the test data that is supplied. Join Tom as he presents demos including free tools. Learn the steps you need to include in your test plans so your team’s apps not only function but also can be monitored and understood from their machine data when running in production.
Operational systems manage our finances, shopping, devices and much more. Adding real-time analytics to these systems enables them to instantly respond to changing conditions and provide immediate, targeted feedback. This use of analytics is called "operational intelligence," and the need for it is widespread.
This talk will explain how in-memory computing techniques can be used to implement operational intelligence. It will show how an in-memory data grid integrated with a data-parallel compute engine can track events generated by a live system, analyze them in real time, and create alerts that help steer the system’s behavior. Code samples will demonstrate how an in-memory data grid employs object-oriented techniques to simplify the correlation and analysis of incoming events by maintaining an in-memory model of a live system.
The talk also will examine simplifications offered by this approach over directly analyzing incoming event streams from a live system using complex event processing or Storm. Lastly, it will explain key requirements of the in-memory computing platform for operational intelligence, in particular real-time updating of individual objects and high availability using data replication, and contrast these requirements to the design goals for stream processing in Spark.
Building a Real-Time Security Application Using Log Data and Machine Learning...Sri Ambati
Building a Real-Time Security Application Using Log Data and Machine Learning- Karthik Aaravabhoomi
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
GraphSummit - Process Tempo - Build Graph Applications.pdfNeo4j
Neo4j offers a powerful platform for developing digital twins and advanced graph data science use cases. Process Tempo accelerates these efforts with a native Neo4j, no-code development environment that combines data visualization with advanced workflow. Learn how the combination of these features can open new value streams for your Neo4j graph investment.
Using AWS to design and build your data architecture has never been easier to gain insights and uncover new opportunities to scale and grow your business. Join this workshop to learn how you can gain insights at scale with the right big data applications.
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
Discover the innovative features and strategic vision that keep WSO2 an industry leader. Explore the exciting 2024 roadmap of WSO2 API management, showcasing innovations, unified APIM/APK control plane, natural language API interaction, and cloud native agility. Discover how open source solutions, microservices architecture, and cloud native technologies unlock seamless API management in today's dynamic landscapes. Leave with a clear blueprint to revolutionize your API journey and achieve industry success!
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...Big Data Spain
Operational systems manage our finances, shopping, devices and much more. Adding real-time analytics to these systems enables them to instantly respond to changing conditions and provide immediate, targeted feedback. This use of analytics is called “operational intelligence,” and the need for it is widespread.
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analyzed, often with many consumers or systems interested in all or part of the events. Storing such huge event streams into HDFS or a NoSQL datastore is feasible and not such a challenge anymore. But if you want to be able to react fast, with minimal latency, you can not afford to first store the data and doing the analysis/analytics later. You have to be able to include part of your analytics right after you consume the data streams. Products for doing event processing, such as Oracle Event Processing or Esper, are available for quite a long time and used to be called Complex Event Processing (CEP). In the past few years, another family of products appeared, mostly out of the Big Data Technology space, called Stream Processing or Streaming Analytics. These are mostly open source products/frameworks such as Apache Storm, Spark Streaming, Flink, Kafka Streams as well as supporting infrastructures such as Apache Kafka. In this talk I will present the theoretical foundations for Stream Processing, discuss the core properties a Stream Processing platform should provide and highlight what differences you might find between the more traditional CEP and the more modern Stream Processing solutions.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
1. Importance of ‘Centralized Event collection’
and BigData platform for Analysis !
DevOpsDays India, Bangalore - 2013
~/Piyush
Manager, Website Operations at MakeMyTrip
2. What to expect:
MakeMyTrip data challenges!
Event Data a.k.a. Logs & Log Analysis
Why Centralized Logging …for systems and applications !
Capturing Events: Why structured data emitted from apps for
machines is a better approach!
Data Service Platform : DSP – Why ?
Inputs: Data for DSP
Top Architecture Considerations
Top level key tasks
Tools Arsenal and API Management and Service Cloud
DevOpsDays India 2013 : ~/Piyush
3. MakeMyTrip data challenges …!
•
•
Multi-DC/colocation setup
Different type of data sources : internal/ external(structured, semi-structured,
unstructured))
– Online Transaction Data Store
– ERP
– CRM
•
Email Behavior / Survey results
– Web Analytics
– Logs
•
•
•
–
–
–
–
Web
Application
User Activity logs
Social Media
Inventory / Catalog
Data residing in excel files
Monitoring Metric Data :
•
•
Graphite (Time-series whisper),
Splunk , ElasticSearch (Logstash)
– Many other different sources
•
Storing and Analyzing Huge Event Data !
DevOpsDays India 2013 : ~/Piyush
4. Some challenges …!
•
•
•
•
•
•
Aggregate web usage data and transactional data to generate one view
Process multiple GB's-TB’s of data every day
Serve more than a million data services API request / day
Ensure business continuity as more and more reliance on MyDSP increases
Store Terabytes of historical data
Meshing transactional (online and offline) data with consumer behavior
and derive analytics
• Build flexible data ingestion platform to manage many data feeds from
multiple data sources
DevOpsDays India 2013 : ~/Piyush
5. Flow of an Event
DevOpsDays India 2013 : ~/Piyush
6. Event Data a.k.a. Logs
• Event Data -> set of chronologically sequenced data records that capture
information about an event !
• Virtually every form of system produces event data
– Capture it from all components and both client and server side events!
• You may call logs as the footprint generated by any activity with the
system/app.
• Event Data has different characteristics from data stored in traditional
data warehouses
– Huge Volume: Event data accumulates rapidly and often must be stored for years; many
organizations are managing hundreds of terabytes and some are managing petabytes.
– Format: Because of the huge variety of sources, event data is unstructured and semi
structured.
– Velocity – New event data is constantly coming in
– Collection : Event data is difficult to collect because of broadly dispersed systems and
networks.
– Time-stamped : Event data is always inserted once with a time-stamp. It never changes.
DevOpsDays India 2013 : ~/Piyush
7. Log Analysis
• Logs are one of the most useful things when it comes to analysis; in simple
terms Log analysis is making sense out of system/app-generated log
messages (or just LOGS). Through logs we get insights into what is
happening into the system.
• Help root cause analysis that occurs after any incident.
• Personalize User Experience Analyzing Web Usage Data
“Security Req“:
• Traditionally some compliance requirements too of : Log Management
/SEM+ SIM => SIEM
• For Data Security – to have one centralized platform for collecting ALL
events (Logs) , correlate them and have real time intelligent visibility.
• To not just monitor network, OS , devices etc. but ALL applications ,
business processes too.
DevOpsDays India 2013 : ~/Piyush
8. Why Centralized Logging …for systems and applications !
• Need for Centralized Logging is quiet important nowadays due to:–
–
–
–
growth in number of applications,
distributed architecture (Service Oriented Architecture)
Cloud based apps
number of machines and infrastructure size is increasing day by day.
• This means that centralized logging and the ability to spot errors in a
distributed systems & applications has become even more “valuable” &
“needed”.
And most importantly
– be able to understand the customers and how they interact with websites;
– Understanding Change: whether using A/B or Multivariate experiments or tweak /
understand new implementations.
DevOpsDays India 2013 : ~/Piyush
9. Capturing Events: Why structured data emitted from apps for
machines is a better approach!
• Need for standardization:– Developers assume that the first level consumer of a log message is a human and they
only know what information is needed to debug an issue.
Logs are not just for humans!
The primary consumers of logs are shifting from humans to computers. This means log
formats should have a well-defined structure that can be parsed easily and robustly.
Logs change!
If the logs never changed, writing a custom parser might not be too terrible. The
engineer would write it once and be done. But in reality, logs change.
Every time you add a feature, you start logging more data, and as you add more data,
the printf-style format inevitably changes. This implies that the custom parser has to be
updated constantly, consuming valuable development time.
• Suggested Approach : “Logging in JSON Format”
– Just to keep it simple and generic for any Application the approach
recommended is to {Key: Value} , JSON Log Format (structured/semistructured).
– This approach will be helpful for easy parsing and consumption, which
would be irrespective of whatever technology/tools we choose to use!
DevOpsDays India 2013 : ~/Piyush
10. Key things to keep in mind/ Rules
•
•
•
•
•
•
•
•
•
•
Use timestamps for every event
Use unique identifiers (IDs) like Transaction ID / User ID / Session ID or may be
append unique user Identification (UUID) number to track unique users.
Log in text format / means Avoid logging binary information!
Log anything that can add value when aggregated, charted, or further
analyzed.
Use categories: like “severity”: “WARN”, INFO, WARN, ERROR, and DEBUG.
The 80/20 Rule: %80 or of our goals can be achieved with %20 of the work, so
don’t log too much
NTP synced same date time / timezone on every producer and collector
machine(#ntpdate ntp.example.com).
Reliability: Like video recordings … you don’t’ want to lose the most valuable
shoot … so you record every frame and then later during analysis; you may
throw away rest of the stuff…picking your best shoot / frame. Here also – logs
as events are recorded & should be recorded with proper reliability so that
you don’t’ lose any important and usable part of it like the important video
frame.
Correlation Rules for various event streams to generated and minimize
alerts/events.
Write Connectors for integrations
DevOpsDays India 2013 : ~/Piyush
11. Data Service Platform : DSP
Why we need a data services platform ?
-
-
Integration Layer to bring data from more
sources in less time
Serve various components – applications
and also to Monitoring systems etc.
DevOpsDays India 2013 : ~/Piyush
12. Inputs : Data – what data to include
• Clickstream / Web Usage Data
– User Activity Logs
• Transactional Data Store
• Off-line
– CRM
– Email Behavior -> Logs/ Events
DevOpsDays India 2013 : ~/Piyush
13. Top Architecture Considerations
•
•
•
•
•
Non blocking data ingestion
UUID Tagged Events / messages
Load balanced data processing across data centers
Use of memory based data storage for real-time data systems
Easy scalable, HA - highly available and easy to maintain large historical
data sets
• Data caching to achieve low latency
• To ensure Business Continuity , parallel process between two different
data centers
• Use of Centralized service cloud for API management , security
(authentication, authorization), metering and integration
DevOpsDays India 2013 : ~/Piyush
14. Top level key tasks for User Activity Logging & Analysis
1. Data Collection of both Client-Side and Server-Side user activity streams
•
•
Tag every Website visitor with UUID similar to the System UUID’s
Collect the activity streams on BigData Platform for Analysis through Kafka Queues & NoSQL data
stores
2. Near real-time Data Processing
•
Preprocessing / Aggregations
•
•
Filtering etc.
Pattern Discovery along with the already available cooked data from point 4
•
Clustering/Classification/association discovery/Sequence Mining
3. Rule Engine / recommendations algorithms
•
•
Rule Engine : Building effective business rule engine / Correlate Events
Content-based filtering / Collaborative Filtering
4. Batch Processing / post processing using Hadoop Ecosystem
•
Analysis & Storing Cooked data in NoSQL data store
5. Data Services (Web-services)
•
RESTful API’s to make the data/insights consumable through various data services
6. Reporting/Search interface & Visualization for Product Development teams and other
business owners.
DevOpsDays India 2013 : ~/Piyush
15. Data System
Lets’ store
everything!
Query =
function (data)
Layered
Architecture:
• every event : Data !
• Precompute View
• Batch Layer : Hadoop M/R
• Speed Layer : Storm NRT Computation
• Serving Layer
DevOpsDays India 2013 : ~/Piyush
17. Clickstream / User Activities Capture : Data is-> “Events”
•
•
Tag every Website visitor with UUID using Apache module - Done
– https://github.com/piykumar/modified_mod_cookietrack
– Cookie : UUID like 24617072-3124-674f-4b72-675746562434.1381297617597249
JSON Messages like
{
"timestamp": "2012-12-14T02:30:18",
"facility": "clientSide",
"clientip": "123.123.123.123",
"uuid": "24617072-3124-5544-2f61-695256432432.1379399183414528",
"domain": "www.example.com",
"server": "abc-123",
"request": "/page/request",
"pagename": "funnel:example com:page1",
"searchKey": "1234567890_",
"sessionID": "11111111111111",
"event1": "loading",
"event2": "interstitial display banner",
"severity": "WARN",
"short_message": "....meaning short message for aggregation...",
"full_message": "full LOG message",
"userAgent": "...blah...blah..blah...",
"RT": 2
}
DevOpsDays India 2013 : ~/Piyush
18. Tools Arsenal
•
•
•
•
•
•
•
•
•
•
•
ETL : Talend
BI : SpagoBI & QlikView
Hadoop : Hortonworks
NRT Computation: Twitter Storm
Document-Oriented NoSQL DB : Couchbase
Distributed Search: ElasticSearch
Log Collection: Flume, Logstash, Syslog-NG
Distributed messaging system : Kafka , RabbitMQ
NoSQL : Cassandra, Redis, Neo4J (Graph)
API Management : WSO2 API Manager, 3Scale /Nginx
Programming Languages : Java , Python, R
DevOpsDays India 2013 : ~/Piyush
19. API Management and Data Services
Cloud
• 3Scale / Nginx , WSO2: API Manager etc
– For centralized distributed repository to serve API’s and provides
throttling,meetring, Security features etc.
• Inject building a data services layer in Culture
and make sure what ever components you
create you have some way to chain it in the
pipeline or call in independently.
DevOpsDays India 2013 : ~/Piyush
20. Thanks!
Questions – If Any !
~/Piyush
@piykumar
http://piyush.me
DevOpsDays India 2013 : ~/Piyush