Javantura v3 - Logs – the missing gold mine – Franjo Žilić

•

0 likes•1,417 views

HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association

This document discusses using ELK (Elasticsearch, Logstash, Kibana) to gain insights from logs. It describes the components of ELK - Elasticsearch as the database, Kibana as the UI, and Logstash to parse logs. Logstash can use GROK patterns to parse logs into structured data for analysis in Kibana. The document provides examples of using ELK to track web traffic, user activity, and API responses for benefits like reducing costs and monitoring performance. While custom solutions can be built, ELK is beneficial as it requires no software costs and allows full control over collected log data.

LOGS
THE MISSING GOLD MINE
by Franjo Žilić

WHY TALK ABOUT LOGS?
They are all around us
We use them to debug our software every day
Is that all tere is?

CONVENTIONAL WAY
grep -v " 200 " access_log-20160124

ELK
elasticsearch
logstash*
kibana
* or something else

ELASTICSEARCH
The database
document oriented
distributed
sharded
replicated*
timestamp partitioning*
Java & lucene

LOGSTASH
The parser
uses GROK
plugins for nearly everything

PARSING LOGS
filter {
grok {
match => { "message" => "%{COMBINEDAPACHELOG}" }
}
date {
match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
}
}

PARSING LOGS
^(?#regex designed to parse VyOS kernel log)(?#some global parsing,
like timestamp, fitlter, interfaces, and so on)(?<time>[^ ]* [^ ]* [
^ ]*) (?<host>[^ ]*) (?<vyos_sylog_facility>[^: ]*)?: [(?<vyos_fw_f
ilter_name>[^[]*)] ?IN=( |(?<vyos_in_interface>[^ ]*) )OUT=( |(?<v
yos_out_interface>[^ ]*) )(MAC=( |(?<vyos_mac_addres>[^ ]*) ))?SRC=(
|(?<vyos_source_ip_address>[^ ]*) )DST=( |(?<vyos_dstination_ip_add
ress>[^ ]*) )LEN=( |(?<vyos_len>[^ ]*) )TOS=( |(?<vyos_tos>[^ ]*) )P
REC=( |(?<vyos_prec>[^ ]*) )TTL=( |(?<vyos_ttl>[^ ]*) )ID=( |(?<vyps
_packet_id>[^ ]*) )(?<vyos_packet_flags>[^ |(PROTO)]*)? ?PROTO=( |(?
<vyps_ip_protocol>[^ ]*))(?#here comes the fun part, different parse
r for different interesting packet types, regex if and positive look
behind matching each type of interesting )(?:(?<=(TCP))((?#tcp speci
fic matchers) ?SPT=( |(?<vyos_source_port>[^ ]*) )DPT=( |(?<vyos_des
tination_port>[^ ]*) )WINDOW=( |(?<vyos_tcp_window>[^ ]*) )RES=( |(?
<vyos_res>[^ ]*) )(?<vyps_tcp_state>[^(URGP)]* ).*)|(?:(?<=(UDP))((?
#udp specific matchers) ?SPT=( |(?<vyos_source_port>[^ ]*) )DPT=( |(
?<vyos_destination_port>[^ ]*) ).*)|(?:(?<=(ICMP))((?#icmp specific
matchers) TYPE=( |(?<vyos_icmp_type>[^ ]*) )CODE=( |(?<vyos_icmp_cod
e>[^ ] )).*)|(.*))))$

CUSTOM SOLUTION?
Implement custom data collection within the application
Populate data with Servlet filters or Spring AOP
Index data in Elasticsearch

BENEFITS
Logging that meets your needs
Ability to extract analytical data
Near real-time event tracking

REAL WORLD EXAMPLES
Tracking HTTP traffic
Tracking user activity
Tracking 3rd party API responses

HTTP DATA
Bandwith costs money
Web site scrapers are common
Serving non-compressed data is expensive
Identification of scrapers can reduce cost

3RD PARTY API TRACKING
Log all requests and responses
Monitor performance
Monitor availability
Provide extra troubleshooting data

BENEFITS OVER COMMERTIAL
SOLUTIONS
no software cost
on site or cloud install
full control over data
fully customizable

DRAWBACKS
additional setup
nothing is predefined
no "enterprise" support

CONCLUSIONS
Utilize data that you already have
learn more abiout your users and applications

THANK YOU
elastic.co
Fluentd
Apache Flume

Serverless is a hot topic in the software architecture world and also one of the points of contention. Serverless let us run our code without provisioning or managing servers. We don't have to think about servers at all. Things like elasticity or resilience might not longer be our problem anymore. On the other hand, we have to embrace a little bit different approach how to design our applications. We also have to give up a lot of control we might want and the most importantly we have to use technology which just might not be ready. In this talk, I’d like to discuss if it is worth to use serverless in our applications, what are the advantages and disadvantages of this approach. Secondly, I'd like to describe various use cases we were considering serverless and what was the result. And finally, I’d like to talk about how Scala fits into this. This talk should be interesting for everyone who is considering using serverless or just heard this word somewhere and would like to learn more. The talk is a little bit more focused on AWS but the understanding of the concepts I’m going to talk about should be beneficial even if you prefer a different service provider.

Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...

Databricks

At Strava we have extensively leveraged Apache Spark to explore our data of over a billion activities, from tens of millions of athletes. This talk will be a survey of the more unique and exciting applications: A Global Heatmap gives a ~2 meter resolution density map of one billion runs, rides, and other activities consisting of three trillion GPS points from 17 billion miles of exercise data. The heatmap was rewritten from a non-scalable system into a highly scalable Spark job enabling great gains in speed, cost, and quality. Locally sensitive hashing for GPS traces was used to efficiently cluster 1 billion activities. Additional processes categorize and extract data from each cluster, such as names and statistics. Clustering gives an automated process to extract worldwide geographical patterns of athletes. Applications include route discovery, recommendation systems, and detection of events and races. A coarse spatiotemporal index of all activity data is stored in Apache Cassandra. Spark streaming jobs maintain this index and compute all space-time intersections (“flybys”) of activities in this index. Intersecting activity pairs are then checked for spatiotemporal correlation, indicated by connected components in the graph of highly correlated pairs form “Group Activities”, creating a social graph of shared activities and workout partners. Data from several hundred thousand runners was used to build an improved model of the relationship between running difficulty and elevation gradient (Grade Adjusted Pace).

Cassandra as event sourced journal for big data analytics

Anirvan Chakraborty

Avoiding destructive updates and keeping history of data using event sourcing approaches has large advantages for data analytics. This talk describes how Cassandra can be used as event journal as part of CQRS/Lambda Architecture using event sourcing and further used for data mining and machine learning purposes in a big data pipeline. All the principles are demonstrated on an application called Muvr that we built. It uses data from wearable devices such as accelerometer in a watch or heartbeat monitor to classify user's exercises in near real time. It uses mobile devices and clustered Akka actor framework to distribute computation and then stores events as immutable facts in journal backed by Cassandra. The data are then read by Apache Spark and used for more expensive analytics and machine learning tasks such as suggests improvements to user's exercise routine or improves machine learning models for better real time exercise classification that can be used immediately. The talk mentions some of the internals of Spark when working with Cassandra and focuses on its machine learning capabilities enabled by Cassandra. A lot of the analytics are done for each user individually so the whole pipeline must handle potentially large amount of concurrent users and a lot of raw data so we need to ensure attributes such as responsiveness, elasticity and resilience.

Flink and Kafka are popular components to build an open source stream processing infrastructure. We present how Flink integrates with Kafka to provide a platform with a unique feature set that matches the challenging requirements of advanced stream processing applications. In particular, we will dive into the following points: Flink’s support for event-time processing, how it handles out-of-order streams, and how it can perform analytics on historical and real-time streams served from Kafka’s persistent log using the same code. We present Flink’s windowing mechanism that supports time-, count- and session- based windows, and intermixing event and processing time semantics in one program. How Flink’s checkpointing mechanism integrates with Kafka for fault-tolerance, for consistent stateful applications with exactly-once semantics. We will discuss “”Savepoints””, which allows users to save the state of the streaming program at any point in time. Together with a durable event log like Kafka, savepoints allow users to pause/resume streaming programs, go back to prior states, or switch to different versions of the program, while preserving exactly-once semantics. We explain the techniques behind the combination of low-latency and high throughput streaming, and how latency/throughput trade-off can configured. We will give an outlook on current developments for streaming analytics, such as streaming SQL and complex event processing.

Reactive programming on Android

Tomáš Kypta

Test strategies for data processing pipelines

Lars Albertsson

This talk will present recommended patterns and corresponding anti-patterns for testing data processing pipelines. We will suggest technology and architecture to improve testability, both for batch and streaming processing pipelines. We will primarily focus on testing for the purpose of development productivity and product iteration speed, but briefly also cover data quality testing. Presented at highloadstrategy.com 2016 by Lars Albertsson (independent, www.mapflat.com), joint work with Øyvind Løkling (Schibsted Products & Technology).

Lambda Architecture Using SQL

SATOSHI TAGOMORI

Airstream: Spark Streaming At Airbnb

Jen Aman

ksqlDB - Stream Processing simplified!

Guido Schmutz

ksqlDB is a stream processing SQL engine, which allows stream processing on top of Apache Kafka. ksqlDB is based on Kafka Stream and provides capabilities for consuming messages from Kafka, analysing these messages in near-realtime with a SQL like language and produce results again to a Kafka topic. By that, no single line of Java code has to be written and you can reuse your SQL knowhow. This lowers the bar for starting with stream processing significantly. ksqlDB offers powerful capabilities of stream processing, such as joins, aggregations, time windows and support for event time. In this talk I will present how KSQL integrates with the Kafka ecosystem and demonstrate how easy it is to implement a solution using ksqlDB for most part. This will be done in a live demo on a fictitious IoT sample.

OrientDB - Time Series and Event Sequences - Codemotion Milan 2014

Luigi Dell'Aquila

Stateful Distributed Stream Processing

Gyula Fóra

More complex streaming applications generally need to store some state of the running computations in a fault-tolerant manner. This talk discusses the concept of operator state and compares state management in current stream processing frameworks such as Apache Flink Streaming, Apache Spark Streaming, Apache Storm and Apache Samza. We will go over the recent changes in Flink streaming that introduce a unique set of tools to manage state in a scalable, fault-tolerant way backed by a lightweight asynchronous checkpointing algorithm. Talk presented in the Apache Flink Bay Area Meetup group on 08/26/15

Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark

Databricks

Streaming ETL to Elastic with Apache Kafka and KSQL

confluent

Companies are recognizing the importance of a low-latency, scalable, fault-tolerant data backbone, in the form of the Apache Kafka streaming platform. With Kafka, developers can integrate multiple sources and systems, enableing low latency analytics, event-driven architectures and the population of multiple downstream systems. These data pipelines can be built using configuration alone. In this talk we’ll see how easy it is to stream data from sources such as databases and into Kafka using the Kafka Connect API. We’ll use KSQL to filter, aggregate and join it to other data, and then stream this enriched data from Kafka out into targets such as Elasticsearch. All of this can be accomplished without a single line of code!

Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...

Flink Forward

We have built a Flink-based system to allow our business users to configure processing rules on a Kafka stream dynamically. Additionally it allows the state to be built dynamically using replay of targeted messages from a long term storage system. This allows for new rules to deliver results based on prior data or to re-run existing rules that had breaking changes or a defect. Why we submitted this talk: We developed a unique solution that allows us to handle on the fly changes of business rules for stateful stream processing. This challenge required us to solve several problems -- data coming in from separate topics synchronized on a tracer-bullet, rebuilding state from events that are no longer on Kafka, and processing rule changes without interrupting the stream.

Tuning and Debugging in Apache Spark

Patrick Wendell

Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.

It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify

HostedbyConfluent

Lambda Architecture has been a common way to build data pipelines for a long time, despite difficulties in maintaining two complex systems. An alternative, Kappa Architecture, was proposed in 2014, but many companies are still reluctant to switch to Kappa. And there is a reason for that: even though Kappa generally provides a simpler design and similar or lower latency, there are a lot of practical challenges in areas like exactly-once delivery, late-arriving data, historical backfill and reprocessing. In this talk, I want to show how you can solve those challenges by embracing Apache Kafka as a foundation of your data pipeline and leveraging modern stream-processing frameworks like Apache Kafka Streams and Apache Flink.

Apache Beam: A unified model for batch and stream processing data

DataWorks Summit/Hadoop Summit

Spark Summit - Stratio Streaming

Stratio

NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis

Helena Edelson

The State of Stream Processing

confluent

Speaker: Neil Avery, Technologist, Office of the CTO, Confluent Stream processing is now at the forefront of many company strategies. Over the last couple of years we have seen streaming use cases explode and now proliferate the landscape of any modern business. Use cases including digital transformation, IoT, real-time risk, payments microservices and machine learning are all built on the fundamental that they need fast data and they need it at scale. Apache Kafka® has long been the streaming platform of choice, its origins of being dumb pipes for big data have long since been left behind and now it is the goto-streaming platform of choice. Stream processing beckons as being the vehicle for driving those streams, and along with it brings a world of real-time semantics surrounding windowing, joining, correctness, elasticity, and accessibility. The ‘current state of stream processing’ walks through the origins of stream processing, applicable use cases and then dives into the challenges currently facing the world of stream processing as it drives the next data revolution. Neil is a Technologist in the Office of the CTO at Confluent, the company founded by the creators of Apache Kafka. He has over 20 years of expertise of working on distributed computing, messaging and stream processing. He has built or redesigned commercial messaging platforms, distributed caching products as well as developed large scale bespoke systems for tier-1 banks. After a period at ThoughtWorks, he went on to build some of the first distributed risk engines in financial services. In 2008 he launched a startup that specialised in distributed data analytics and visualization. Prior to joining Confluent he was the CTO at a fintech consultancy. Watch the recording: https://videos.confluent.io/watch/rmU6GHrd4EKFaZrRhdTE3s?.

Distributed Stream Processing - Spark Summit East 2017

Petr Zapletal

The demand for stream processing is increasing a lot these days. Immense amounts of data have to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples. A number of powerful, easy-to-use open source platforms have emerged to address this. But the same problem can be solved differently, various but sometimes overlapping use-cases can be targeted or different vocabularies for similar concepts can be used. This may lead to confusion, longer development time or costly wrong decisions.

Javantura v3 - Rational Team Concert – integrated agile development and colla...

HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association

Javantura v3 - ES6 – Future Is Now – Nenad Pečanac

HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association

What's hot

Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust

Spark Summit

Scalable real-time processing techniques

Lars Albertsson

Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen

confluent

Reactive programming on Android

Tomáš Kypta

Test strategies for data processing pipelines

Lars Albertsson

Lambda Architecture Using SQL

SATOSHI TAGOMORI

Airstream: Spark Streaming At Airbnb

Jen Aman

ksqlDB - Stream Processing simplified!

Guido Schmutz

OrientDB - Time Series and Event Sequences - Codemotion Milan 2014

Luigi Dell'Aquila

Stateful Distributed Stream Processing

Gyula Fóra

Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark

Databricks

Streaming ETL to Elastic with Apache Kafka and KSQL

confluent

Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...

Flink Forward

Tuning and Debugging in Apache Spark

Patrick Wendell

It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify

HostedbyConfluent

Apache Beam: A unified model for batch and stream processing data

DataWorks Summit/Hadoop Summit

Spark Summit - Stratio Streaming

Stratio

NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis

Helena Edelson

The State of Stream Processing

confluent

Distributed Stream Processing - Spark Summit East 2017

Petr Zapletal

What's hot (20)

Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust

Scalable real-time processing techniques

Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen

Reactive programming on Android

Test strategies for data processing pipelines

Lambda Architecture Using SQL

Airstream: Spark Streaming At Airbnb

ksqlDB - Stream Processing simplified!

OrientDB - Time Series and Event Sequences - Codemotion Milan 2014

Stateful Distributed Stream Processing

Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark

Streaming ETL to Elastic with Apache Kafka and KSQL

Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...

Tuning and Debugging in Apache Spark

It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify

Apache Beam: A unified model for batch and stream processing data

Spark Summit - Stratio Streaming

NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis

The State of Stream Processing

Distributed Stream Processing - Spark Summit East 2017

Viewers also liked

Javantura v3 - Rational Team Concert – integrated agile development and colla...

HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association

Javantura v3 - ES6 – Future Is Now – Nenad Pečanac

HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association

Javantura v3 - Conquer the Internet of Things with Java and Docker – Johan Ja...

HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association

Javantura v3 - CQRS – another view on application architecture – Aleksandar S...

HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association

Javantura v3 - Java & JWT Stateless authentication – Karlo Novak

HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association

Javantura v3 - Going Reactive with RxJava – Hrvoje Crnjak

HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association

Javantura v3 - FIWARE – from ideas to real projects – Krunoslav Hrnjak

HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association

Javantura v3 - Just say it – using language to communicate with the computer ...

HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association

Javantura v3 - The Internet of (Lego) Trains – Johan Janssen, Ingmar van der ...

HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association

Javantura v3 - Apache Spark revolution – what’s it all about – Petar Zečević

HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association

Javantura v3 - Husky – (y)our tool for tracking value in data – Mladen Marovi...

HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association

Javantura v3 - Microservice – no fluff the REAL stuff – Nakul Mishra

HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association

Javantura v3 - What really motivates developers – Ivan Krnić

HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association

Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...

HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association

Javantura v4 - DMN – supplement your BPMN - Željko Šmaguc

HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association

Javantura v4 - CroDuke Indy and the Kingdom of Java Skills - Branko Mihaljevi...

HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association

Javantura v4 - Java or Scala – Web development with Playframework 2.5.x - Kre...

HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association

Javantura v4 - JVM++ The GraalVM - Martin Toshev

HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association

Javantura v4 - FreeMarker in Spring web - Marin Kalapać

HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association

Javantura v4 - Let me tell you a story why Scrum is not for you - Roko Roić

HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association

Viewers also liked (20)