W październiku 2015 roku rozpoczęły się prace nad przygotowaniem oprogramowania komputera pokładowego polskiego satelity PW-Sat2. Od tego momentu Maciek bierze w nich udział i łączy świat „nudnych biznesowych .NET-owych aplikacji webowych” ze światem elektroniki, C++ i Kosmosu. W ramach swojego wykładu opowiedział o technologii, a także o różnicach i podobieństwach w obu światach oraz wyzwaniach — tych przyziemnych i tych pozaziemskich.
Pewnego dnia przychodzi do Ciebie szef i mówi, że od dzisiaj przestajesz testować aplikacje webowe i będziesz testował oprogramowanie wbudowane. Oczywiście nie byłoby w tym nic dramatycznego gdyby nie okazało się, że to oprogramowanie wbudowane w nanosatelity. Z pewnością w głowie testera pojawiają się pytania jak bardzo zapewnianie jakości w takim środowisku może się różnić od aplikacji czy systemów „naziemnych”.
Wprowadzę Cię w domenę kosmiczną na przykładzie oprogramowania pisanego na satelity, a konkretnie na czujnik słońca, który poleci na jednej z nich.
Zagadnienia, które się pojawią:
– wprowadzenie w działanie nanosatelitów
– podstawy systemów wbudowanych działających w przestrzeni kosmicznej
– środowisko kosmiczne i jego wpływ na oprogramowanie
– różnice w testowaniu satelity od testowania standardowych aplikacji oraz od testowania przemysłowego oprogramowania wbudowanego
– jak testować, co testować, narzędzia, typy testów, standardy
– jak wygląda piramida testów automatycznych
– ograniczenia w testowaniu
– kompetencje testera satelitów/systemów wbudowanych
– wyniki pracy oraz osobista retrospektywa
The Factorization Machines algorithm for building recommendation system - Paw...Evention
One of successful examples of data science applications in the Big Data domain are recommendation systems. The goal of my talk is to present the Factorization Machines algorithm, available in the SAS Viya platform.
The Factorization Machines is a good choice for making predictions and recommendations based on large sparse data, in particular specific for the Big Data. In practical part of the presentation, a low level granularity data from the NBA league will be used to build an application recommending optimal game strategies as well as predicting results of league games.
A/B testing powered by Big data - Saurabh Goyal, Booking.comEvention
At Booking we have more than a million properties selling their rooms to our customers. We have approximately 1000 events per minute from them leading to total 500 GB of data for partner events alone.
In order to make sure we receive the relevant inventory from our partners we A/B test various new features. There were more than 100 experiments focusing on availability alone in one quarter.
In my talk I ll be talking about A/B testing at Booking, different technologies like Hadoop, Hbase, Cassandra, Kafka etc that we use to store and process large volumes of data and building up of metrics to measure the success of our experiments.
Near Real-Time Fraud Detection in Telecommunication Industry - Burak Işıklı, ...Evention
In general, fraud is the common painful area in the telecom sector, and detecting fraud is like finding a needle in the haystack due to volume and velocity of data. There are 2 key factors to detect fraud:
(1). Speed: If you can’t detect in time, you’re doomed to loose because they’ve already got what they need. Simbox detection is one of the use case for this situation. Frauders use it to bypass interconnection fee. In this use case we’re talking about our real time architecture using Spark SQL to detect simbox within 5 minutes.
(2). Accuracy: Frauders changes their method all the time. But our job is finding their behaviour using machine learning algorithms accurately. Anomaly detection is one of the use case for this situation. In this use case we’re talking about data mining architecture to make fraud models using Spark ML within 1 hour. We also discuss some ML algorithm performance on Spark such as K-means, three sigma rule, T-digest and so on. In order to accomplish these factors, we processes 8-10 billion records which size is 4-5 TB every day. Our solution combines end-to-end data ingestion, processing, and mining the high volume data to detect some use cases of fraud in near real time using CDR and IPTDR to save millions, and better user experience.
Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...Evention
Nowadays many companies become data rich and intensive. They have millions of users generating billions of interactions and events per day.
These massive streams of complex events can be processed and reacted upon to e.g. offer new products, next best actions, communicate to users or detect frauds, and quicker we can do it, the higher value we can generate.
In this talk we will present, how in joint development with our client and in just few months effort we have built from ground up a complex event processing platform for their intensive data streams. We will share how the system runs marketing campaigns or detect frauds by following behavior of millions users in real-time and reacting on it instantly. The platform designed and built with Big Data technologies to infinitely and cost-effectively scale already ingests and processes billions of messages or terabytes of data per day on a still small cluster. We will share how we leveraged the current best of breed open-source projects including Apache Flink, Apache Nifi and Apache Kafka, but also what interesting problems we needed to solve. Finally, we will share where we’re heading next, what next use cases we’re going to implement and how.
Machine learning security - Pawel Zawistowski, Warsaw University of Technolog...Evention
Despite rapid progress of tools and methods, security has been almost entirely overlooked in the mainstream machine learning. Unfortunately, even the most sophisticated and carefully crafted models can become victims of using the so-called adversarial examples.
This talk will cover the concepts of adversarial data and machine learning security, go through examples of possible attack vectors and discuss the currently known defence mechanisms.
W październiku 2015 roku rozpoczęły się prace nad przygotowaniem oprogramowania komputera pokładowego polskiego satelity PW-Sat2. Od tego momentu Maciek bierze w nich udział i łączy świat „nudnych biznesowych .NET-owych aplikacji webowych” ze światem elektroniki, C++ i Kosmosu. W ramach swojego wykładu opowiedział o technologii, a także o różnicach i podobieństwach w obu światach oraz wyzwaniach — tych przyziemnych i tych pozaziemskich.
Pewnego dnia przychodzi do Ciebie szef i mówi, że od dzisiaj przestajesz testować aplikacje webowe i będziesz testował oprogramowanie wbudowane. Oczywiście nie byłoby w tym nic dramatycznego gdyby nie okazało się, że to oprogramowanie wbudowane w nanosatelity. Z pewnością w głowie testera pojawiają się pytania jak bardzo zapewnianie jakości w takim środowisku może się różnić od aplikacji czy systemów „naziemnych”.
Wprowadzę Cię w domenę kosmiczną na przykładzie oprogramowania pisanego na satelity, a konkretnie na czujnik słońca, który poleci na jednej z nich.
Zagadnienia, które się pojawią:
– wprowadzenie w działanie nanosatelitów
– podstawy systemów wbudowanych działających w przestrzeni kosmicznej
– środowisko kosmiczne i jego wpływ na oprogramowanie
– różnice w testowaniu satelity od testowania standardowych aplikacji oraz od testowania przemysłowego oprogramowania wbudowanego
– jak testować, co testować, narzędzia, typy testów, standardy
– jak wygląda piramida testów automatycznych
– ograniczenia w testowaniu
– kompetencje testera satelitów/systemów wbudowanych
– wyniki pracy oraz osobista retrospektywa
The Factorization Machines algorithm for building recommendation system - Paw...Evention
One of successful examples of data science applications in the Big Data domain are recommendation systems. The goal of my talk is to present the Factorization Machines algorithm, available in the SAS Viya platform.
The Factorization Machines is a good choice for making predictions and recommendations based on large sparse data, in particular specific for the Big Data. In practical part of the presentation, a low level granularity data from the NBA league will be used to build an application recommending optimal game strategies as well as predicting results of league games.
A/B testing powered by Big data - Saurabh Goyal, Booking.comEvention
At Booking we have more than a million properties selling their rooms to our customers. We have approximately 1000 events per minute from them leading to total 500 GB of data for partner events alone.
In order to make sure we receive the relevant inventory from our partners we A/B test various new features. There were more than 100 experiments focusing on availability alone in one quarter.
In my talk I ll be talking about A/B testing at Booking, different technologies like Hadoop, Hbase, Cassandra, Kafka etc that we use to store and process large volumes of data and building up of metrics to measure the success of our experiments.
Near Real-Time Fraud Detection in Telecommunication Industry - Burak Işıklı, ...Evention
In general, fraud is the common painful area in the telecom sector, and detecting fraud is like finding a needle in the haystack due to volume and velocity of data. There are 2 key factors to detect fraud:
(1). Speed: If you can’t detect in time, you’re doomed to loose because they’ve already got what they need. Simbox detection is one of the use case for this situation. Frauders use it to bypass interconnection fee. In this use case we’re talking about our real time architecture using Spark SQL to detect simbox within 5 minutes.
(2). Accuracy: Frauders changes their method all the time. But our job is finding their behaviour using machine learning algorithms accurately. Anomaly detection is one of the use case for this situation. In this use case we’re talking about data mining architecture to make fraud models using Spark ML within 1 hour. We also discuss some ML algorithm performance on Spark such as K-means, three sigma rule, T-digest and so on. In order to accomplish these factors, we processes 8-10 billion records which size is 4-5 TB every day. Our solution combines end-to-end data ingestion, processing, and mining the high volume data to detect some use cases of fraud in near real time using CDR and IPTDR to save millions, and better user experience.
Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...Evention
Nowadays many companies become data rich and intensive. They have millions of users generating billions of interactions and events per day.
These massive streams of complex events can be processed and reacted upon to e.g. offer new products, next best actions, communicate to users or detect frauds, and quicker we can do it, the higher value we can generate.
In this talk we will present, how in joint development with our client and in just few months effort we have built from ground up a complex event processing platform for their intensive data streams. We will share how the system runs marketing campaigns or detect frauds by following behavior of millions users in real-time and reacting on it instantly. The platform designed and built with Big Data technologies to infinitely and cost-effectively scale already ingests and processes billions of messages or terabytes of data per day on a still small cluster. We will share how we leveraged the current best of breed open-source projects including Apache Flink, Apache Nifi and Apache Kafka, but also what interesting problems we needed to solve. Finally, we will share where we’re heading next, what next use cases we’re going to implement and how.
Machine learning security - Pawel Zawistowski, Warsaw University of Technolog...Evention
Despite rapid progress of tools and methods, security has been almost entirely overlooked in the mainstream machine learning. Unfortunately, even the most sophisticated and carefully crafted models can become victims of using the so-called adversarial examples.
This talk will cover the concepts of adversarial data and machine learning security, go through examples of possible attack vectors and discuss the currently known defence mechanisms.
Building a Modern Data Pipeline: Lessons Learned - Saulius Valatka, AdformEvention
Adform is one of the biggest European ad-tech companies – for example, our RTB engine at peak handles ~1m requests per second, each in under 100 ms, producing ~20TB of data daily.
In this talk I will present the data pipeline and the infrastructure behind it, emphasizing our core principles (such as event sourcing, immutability, correctness) as well as the lessons learned along the way while building it and the state it is converging to.
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansEvention
This talk will start with brief introduction to streaming processing and Flink itself. Next, we will take a look at some of the most interesting recent improvements in Flink such as incremental checkpointing,
end-to-end exactly-once processing guarantee and network latency optimizations. We’ll discuss real problems that Flink’s users were facing and how they were addressed by the community and dataArtisans.
Privacy by Design - Lars Albertsson, MapflatEvention
Privacy and personal integrity has become a focus topic, due to the upcoming GDPR deadline in May 2018 and it’s requirements for data storage, retention, and access. This talk provides an engineering perspective on privacy and highlights pitfalls and topics that require early attention.
The content of the talk is based on real world experience from handling privacy protection in large scale data processing environments.
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...Evention
The way you operate your Big Data environment is not going to be the same anymore. This session is based on our experience managing on-premise environments
and taking the lesson from innovative data-driven companies that successfully migrated their multi PB Hadoop clusters. Where to start and what decisions you have to make to gradually becoming cloud ready. The examples would refer to Google Cloud Platform yet the challenges are common.
Deriving Actionable Insights from High Volume Media Streams - Jörn Kottmann, ...Evention
In this talk we describe how to analyze high volumes of real-time streams of news feeds, social media, blogs in scalable and distributed way using Apache Flink
and Natural Language Processing tools like Apache OpenNLP to perform common NLP tasks like Named Entity Recognition (NER), chunking, and text classification.
Enhancing Spark - increase streaming capabilities of your applications - Kami...Evention
During this session we’ll discuss the pros and cons of a new structured streaming data processing model in Spark and a nifty way of enhancing Spark with SnappyData, an open-source framework providing great features for both persistent and in-motion data analysis.
Based on a real-life use case, where we designed and implemented a streaming application filtering, consuming and aggregating tons of events, we will talk the role of the persistent back-end and stream processing integration in the real-time applications in terms of performance, robustness and scalability of the solution.
7 Days of Playing Minesweeper, or How to Shut Down Whistleblower Defense with...Evention
The next time you find yourself thinking there isn’t enough time in a week, consider what Drinker Biddle did for their client in 7 days.
When a senior executive for a publicly traded company was fired for underperformance, he made a serious allegation on his way out the door. He claimed he was laid off because of his repeated attempts to inform officials that the company was falsifying quarterly financial reports to the public. Instead of waiting for the typical pace of discovery that could potentially cost their client at least a quarter of a million dollars, Drinker Biddle used powerful analytics technology to conduct an intelligent investigation, fast. In this session, you will learn about machine learning that makes digging through large multi-sources data sets possible. You will have a chance to see the backstage of how engineers empower legal teams to organize data, discover the truth and act on it.
Big Data Journey at a Big Corp - Tomasz Burzyński, Maciej Czyżowicz, Orange P...Evention
We will present the journey of Orange Polska evolving from a proprietary ecosystem towards significantly open-source ecosystem based on Hadoop and friends
– a journey particularly challenging at a large corporation. We’ll present key drivers for starting Big Data, evolution of BI, emergence of Data Scientists and advanced analytics along with operational reporting and stream processing to detect issues. This presentation will cover both technical aspects and business environment, as both are inherently linked in process of big data enterprise adoption.
Stream processing with Apache Flink - Maximilian Michels Data ArtisansEvention
Apache Flink is an open source platform for distributed stream and batch data processing. At its core, Flink is a streaming dataflow engine which provides data distribution, communication, and fault tolerance for distributed computations over data streams. On top of this core, APIs make it easy to develop distributed data analysis programs. Libraries for graph processing or machine learning provide convenient abstractions for solving large-scale problems. Apache Flink integrates with a multitude of other open source systems like Hadoop, databases, or message queues. Its streaming capabilities make it a perfect fit for traditional batch processing as well as state of the art stream processing.
Scaling Cassandra in all directions - Jimmy Mardell SpotifyEvention
At Spotify we run over 100 Cassandra clusters, from small 3 node clusters to clusters with up to 100 nodes. Many of them are multi-datacenter clusters. I will talk about the challenges of having so many clusters and what tools we are using and have built for managing them. There will also be some war stories of when we have failed
Big Data for unstructured data Dariusz ŚliwaEvention
Źródłami dla Big Data są zwykle ustrukturalizowane dane, pochodzące z innych systemów i z mechanizmów śledzących kanały interakcji z klientami (lub urządzeniami w przypadku M2M). A co z olbrzymim potencjałem drzemiącym w przepastnych zasobach informacji nieustrukturalizowanej? Jak wydobyć biznesową wartość i zamienić koszt (składowania) takich danych na rzeczywiste aktywa firmy? Poza tradycyjnymi narzędziami analizy Big Data (HPE IDOL czy Vertica) firma Hewlett Packard Enterprise oferuje technologie dla informacji niestrukturalnych. Klasyfikacja i analityka plików oferowana przez HPE ControlPoint pozwala na łatwą ocenę jakości informacji niestrukturalnych oraz na szybkie odsianie zbędnych danych (redundant, obsolete, trivial and dark data). HPE Investigative Analytics łączy źródła danych i analizy nie tylko za pomocą modeli behavioralnych, ale uzupełnia ten obraz o Analizę Nastroju (Sentiment Analysis) oraz Intencje (Intent)
Elastic development. Implementing Big Data search Grzegorz KołpućEvention
Quick look at implementation of search platforms based on ElasticSearch from developer perspective. Full-text search, relevance, geo location, stats, aggregations, alerting - I will show you how pleasant that may be and what traps are waiting for you in the limbo of distributed systems.
H2 o deep water making deep learning accessible to everyone -jo-fai chowEvention
Deep Water is H2O’s integration with multiple open source deep learning libraries such as TensorFlow, MXNet and Caffe. On top of the performance gains from GPU backends, Deep Water naturally inherits all H2O properties in scalability. ease of use and deployment. In this talk, I will go through the motivation and benefits of Deep Water. After that, I will demonstrate how to build and deploy deep learning models with or without programming experience using H2O’s R/Python/Flow (Web) interfaces.
That won’t fit into RAM - Michał BrzezickiEvention
SentiOne is one of the leading solutions in Europe for social media listening and analysis. We monitor over 26 European markets including CEE, Scandinavia, DACH, and the Balkans. The amount of data that is processed every day and is ready to be queried by our users is enormous. Over the years we have tested many technologies and approaches in big data from which many have failed. The presentation includes our experiences and lessons learned on setting up big data company from scratch. I will give details on configuring robust ElasticSearch cluster with over 26TB of data and describe key challenges in efficient web crawling and data extraction
Stream Analytics with SQL on Apache Flink - Fabian HueskeEvention
SQL is undoubtedly the most widely used language for data analytics for many good reasons. It is declarative,
many database systems and query processors feature advanced query optimizers and highly efficient execution engines, and last but not least it is the standard that everybody knows and uses. With stream processing technology becoming mainstream a question arises: “Why isn’t SQL widely supported by open source stream processors?”. One answer is that SQL’s semantics and syntax have not been designed with the characteristics of streaming data in mind. Consequently, systems that want to provide support for SQL on data streams have to overcome a conceptual gap. One approach is to support standard SQL which is known by users and tools but comes at the cost of cumbersome workarounds for many common streaming computations. Other approaches are to design custom SQL-inspired stream analytics languages or to extend SQL with streaming-specific keywords. While such solutions tend to result in more intuitive syntax, they suffer from not being established standards and thereby exclude many users and tools.
Apache Flink is a distributed stream processing system with very good support for streaming analytics. Flink features two relational APIs, the Table API and SQL. The Table API is a language-integrated relational API with stream-specific features. Flink’s SQL interface implements the plain SQL standard. Both APIs are semantically compatible and share the same optimization and execution path based on Apache Calcite.
In this talk we present the future of Apache Flink’s relational APIs for stream analytics, discuss their conceptual model, and showcase their usage. The central concept of these APIs are dynamic tables. We explain how streams are converted into dynamic tables and vice versa without losing information due to the stream-table duality. Relational queries on dynamic tables behave similar to materialized view definitions and produce new dynamic tables. We show how dynamic tables are converted back into changelog streams or are written as materialized views to external systems, such as Apache Kafka or Apache Cassandra, and are updated in place with low latency. We conclude our talk demonstrating the power and expressiveness of Flink’s relational APIs by presenting how common stream analytics use cases can be realized.
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...Evention
Since June 2016, Kafak, Spark and Flink-as-a-service have been available to researchers and companies in
Sweden from the Swedish ICT SICS Data Center at www.hops.site using the HopsWorks platform (www.hops.io). Flink and Spark applications are run within a project on a YARN cluster with the novel property that applications are metered and charged to projects. Projects are also securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In this talk we will discuss the challenges in building multi-tenant streaming applications on YARN that are metered and easy-to-debug. We show how we use the ELK stack (Elasticsearch, Logstash, and Kibana) for logging and debugging running streaming applications, how we use Graphana and Graphite for monitoring streaming applications, and how users can debug and optimize terminated Spark Streaming jobs using Dr Elephant. We will also discuss the experiences of our users (over 120 users as of Oct 2016): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications. Hopsworks is entirely UI-driven with an Apache v2 open source license.
ING CoreIntel - collect and process network logs across data centers in near ...Evention
Security is at the core of every bank activity. ING set an ambitious goal to have an insight into the overall network data activity. The purpose is to quickly recognize and neutralize unwelcomed guests such as malware, viruses and to prevent data leakage or track down misconfigured software components.
Since the inception of the CoreIntel project we knew we were going to face the challenges of capturing, storing and processing vast amount of data of a various type from all over the world. In our session we would like to share our experience in building scalable, distributed system architecture based on Kafka, Spark Streaming, Hadoop and Elasticsearch to help us achieving these goals.
Why choosing good data format matters? How to manage kafka offsets? Why dealing with Elasticsearch is a love-hate relationship for us or how we just managed to put it all these pieces together.
Real Time Data Processing at RTB House - Bartosz ŁośEvention
Our platform, which purchases and runs advertisements in the Real-Time Bidding model, processes 250K bid requests and generates 20K events per every second which gives 3TB data every day. Because of machine learning, system monitoring and financial settlements we need to filter, store, aggregate and join these events together. As a result processed events and aggregated statistics are available in Hadoop, Google BigQuery and Postgres. The most demanding are business requirements such as: events that should be joined together can appear 30 days after each other, we are not allowed to create any duplicates, we have to minimalize possible data losses as well as there could not be any differences between generated data outputs. We have designed and implemented the solution which has reduced delay of availability of this data from 1 day to 15 seconds.
We will preent: Our first approach to the problem (end-of-day batch jobs) and final solution (real-time stream processing) 2. detailed description of the current architecture 3. how we had tested new data flow before it was deployed and in which way it is being monitored now 4. our one-click deployment process 5. decisions which we made with its advantages and disadvantages and our future plans to improve our current solution.
We would like to share our experience connected with scaling solution over clusters of computers in several data centers. We will focus on the current architecture but also on testing and monitoring issues with our deployment process. Finally, we would like to provide an overview of engaged projects like Kafka, Mirrormaker, Storm, Aerospike, Flume, Docker etc. We will describe what we have achieved from given open source and some problems we have come across.
Redundancy for Big Hadoop Clusters is hard - Stuart PookEvention
Criteo had an Hadoop cluster with 39 PB raw stockage, 13404 CPUs, 105 TB RAM, 40 TB data imported per day and over 100000 jobs per day. This cluster was critical in both stockage and compute but without backups. After many efforts to increase our redundancy, we now have two clusters that, combined, have more than 2000 nodes, 130 PB, two different versions of Hadoop and 200000 jobs per day but these clusters do not yet provide a redundant solution to our all storage and compute needs. This talk discusses the choices and issues we solved in creating a 1200 node cluster with new hardware in a new data centre. Some of the challenges involved in running two different clusters in parallel will be presented. We will also analyse what went right (and wrong) in our attempt to achieve redundancy and our plans to improve our capacity to handle the loss of a data centre.
Orchestrating Big Data pipelines @ Fandom - Krystian Mistrzak Thejas MurthyEvention
Fandom is the largest entertainment fan site in the world. With more than 360,000 fan communities and a global
audience of over 190 million monthly uniques, we are the fan’s voice in entertainment. Being the largest entertainment site, wikia generates massive volumes of data, which varies from clickstream, user activities, api requests, ad delivery, A/B testing and much more. The big challenge is not just the volume but the orchestration involved in combining various sources of data with various periodicity, volumes. And Making sure the processed data is available for the consumers within the expected time. Thus helping gain the right insights well within the right time. A conscious decision was made to choose the right open source tool to solve the problem of orchestration, after evaluating various tools we decided to use Apache airflow. This presentation will give an overview of comparisons of existing tools and emphasize on why we choose airflow. And how Airflow is being used to create a stable reliable orchestration platform to enable non data engineers to seamlessly access data by democratizing data. We will focus on some tricks and best practises of developing workflows with Airflow and show how we are using some of the features of airflow.
Building a Modern Data Pipeline: Lessons Learned - Saulius Valatka, AdformEvention
Adform is one of the biggest European ad-tech companies – for example, our RTB engine at peak handles ~1m requests per second, each in under 100 ms, producing ~20TB of data daily.
In this talk I will present the data pipeline and the infrastructure behind it, emphasizing our core principles (such as event sourcing, immutability, correctness) as well as the lessons learned along the way while building it and the state it is converging to.
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansEvention
This talk will start with brief introduction to streaming processing and Flink itself. Next, we will take a look at some of the most interesting recent improvements in Flink such as incremental checkpointing,
end-to-end exactly-once processing guarantee and network latency optimizations. We’ll discuss real problems that Flink’s users were facing and how they were addressed by the community and dataArtisans.
Privacy by Design - Lars Albertsson, MapflatEvention
Privacy and personal integrity has become a focus topic, due to the upcoming GDPR deadline in May 2018 and it’s requirements for data storage, retention, and access. This talk provides an engineering perspective on privacy and highlights pitfalls and topics that require early attention.
The content of the talk is based on real world experience from handling privacy protection in large scale data processing environments.
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...Evention
The way you operate your Big Data environment is not going to be the same anymore. This session is based on our experience managing on-premise environments
and taking the lesson from innovative data-driven companies that successfully migrated their multi PB Hadoop clusters. Where to start and what decisions you have to make to gradually becoming cloud ready. The examples would refer to Google Cloud Platform yet the challenges are common.
Deriving Actionable Insights from High Volume Media Streams - Jörn Kottmann, ...Evention
In this talk we describe how to analyze high volumes of real-time streams of news feeds, social media, blogs in scalable and distributed way using Apache Flink
and Natural Language Processing tools like Apache OpenNLP to perform common NLP tasks like Named Entity Recognition (NER), chunking, and text classification.
Enhancing Spark - increase streaming capabilities of your applications - Kami...Evention
During this session we’ll discuss the pros and cons of a new structured streaming data processing model in Spark and a nifty way of enhancing Spark with SnappyData, an open-source framework providing great features for both persistent and in-motion data analysis.
Based on a real-life use case, where we designed and implemented a streaming application filtering, consuming and aggregating tons of events, we will talk the role of the persistent back-end and stream processing integration in the real-time applications in terms of performance, robustness and scalability of the solution.
7 Days of Playing Minesweeper, or How to Shut Down Whistleblower Defense with...Evention
The next time you find yourself thinking there isn’t enough time in a week, consider what Drinker Biddle did for their client in 7 days.
When a senior executive for a publicly traded company was fired for underperformance, he made a serious allegation on his way out the door. He claimed he was laid off because of his repeated attempts to inform officials that the company was falsifying quarterly financial reports to the public. Instead of waiting for the typical pace of discovery that could potentially cost their client at least a quarter of a million dollars, Drinker Biddle used powerful analytics technology to conduct an intelligent investigation, fast. In this session, you will learn about machine learning that makes digging through large multi-sources data sets possible. You will have a chance to see the backstage of how engineers empower legal teams to organize data, discover the truth and act on it.
Big Data Journey at a Big Corp - Tomasz Burzyński, Maciej Czyżowicz, Orange P...Evention
We will present the journey of Orange Polska evolving from a proprietary ecosystem towards significantly open-source ecosystem based on Hadoop and friends
– a journey particularly challenging at a large corporation. We’ll present key drivers for starting Big Data, evolution of BI, emergence of Data Scientists and advanced analytics along with operational reporting and stream processing to detect issues. This presentation will cover both technical aspects and business environment, as both are inherently linked in process of big data enterprise adoption.
Stream processing with Apache Flink - Maximilian Michels Data ArtisansEvention
Apache Flink is an open source platform for distributed stream and batch data processing. At its core, Flink is a streaming dataflow engine which provides data distribution, communication, and fault tolerance for distributed computations over data streams. On top of this core, APIs make it easy to develop distributed data analysis programs. Libraries for graph processing or machine learning provide convenient abstractions for solving large-scale problems. Apache Flink integrates with a multitude of other open source systems like Hadoop, databases, or message queues. Its streaming capabilities make it a perfect fit for traditional batch processing as well as state of the art stream processing.
Scaling Cassandra in all directions - Jimmy Mardell SpotifyEvention
At Spotify we run over 100 Cassandra clusters, from small 3 node clusters to clusters with up to 100 nodes. Many of them are multi-datacenter clusters. I will talk about the challenges of having so many clusters and what tools we are using and have built for managing them. There will also be some war stories of when we have failed
Big Data for unstructured data Dariusz ŚliwaEvention
Źródłami dla Big Data są zwykle ustrukturalizowane dane, pochodzące z innych systemów i z mechanizmów śledzących kanały interakcji z klientami (lub urządzeniami w przypadku M2M). A co z olbrzymim potencjałem drzemiącym w przepastnych zasobach informacji nieustrukturalizowanej? Jak wydobyć biznesową wartość i zamienić koszt (składowania) takich danych na rzeczywiste aktywa firmy? Poza tradycyjnymi narzędziami analizy Big Data (HPE IDOL czy Vertica) firma Hewlett Packard Enterprise oferuje technologie dla informacji niestrukturalnych. Klasyfikacja i analityka plików oferowana przez HPE ControlPoint pozwala na łatwą ocenę jakości informacji niestrukturalnych oraz na szybkie odsianie zbędnych danych (redundant, obsolete, trivial and dark data). HPE Investigative Analytics łączy źródła danych i analizy nie tylko za pomocą modeli behavioralnych, ale uzupełnia ten obraz o Analizę Nastroju (Sentiment Analysis) oraz Intencje (Intent)
Elastic development. Implementing Big Data search Grzegorz KołpućEvention
Quick look at implementation of search platforms based on ElasticSearch from developer perspective. Full-text search, relevance, geo location, stats, aggregations, alerting - I will show you how pleasant that may be and what traps are waiting for you in the limbo of distributed systems.
H2 o deep water making deep learning accessible to everyone -jo-fai chowEvention
Deep Water is H2O’s integration with multiple open source deep learning libraries such as TensorFlow, MXNet and Caffe. On top of the performance gains from GPU backends, Deep Water naturally inherits all H2O properties in scalability. ease of use and deployment. In this talk, I will go through the motivation and benefits of Deep Water. After that, I will demonstrate how to build and deploy deep learning models with or without programming experience using H2O’s R/Python/Flow (Web) interfaces.
That won’t fit into RAM - Michał BrzezickiEvention
SentiOne is one of the leading solutions in Europe for social media listening and analysis. We monitor over 26 European markets including CEE, Scandinavia, DACH, and the Balkans. The amount of data that is processed every day and is ready to be queried by our users is enormous. Over the years we have tested many technologies and approaches in big data from which many have failed. The presentation includes our experiences and lessons learned on setting up big data company from scratch. I will give details on configuring robust ElasticSearch cluster with over 26TB of data and describe key challenges in efficient web crawling and data extraction
Stream Analytics with SQL on Apache Flink - Fabian HueskeEvention
SQL is undoubtedly the most widely used language for data analytics for many good reasons. It is declarative,
many database systems and query processors feature advanced query optimizers and highly efficient execution engines, and last but not least it is the standard that everybody knows and uses. With stream processing technology becoming mainstream a question arises: “Why isn’t SQL widely supported by open source stream processors?”. One answer is that SQL’s semantics and syntax have not been designed with the characteristics of streaming data in mind. Consequently, systems that want to provide support for SQL on data streams have to overcome a conceptual gap. One approach is to support standard SQL which is known by users and tools but comes at the cost of cumbersome workarounds for many common streaming computations. Other approaches are to design custom SQL-inspired stream analytics languages or to extend SQL with streaming-specific keywords. While such solutions tend to result in more intuitive syntax, they suffer from not being established standards and thereby exclude many users and tools.
Apache Flink is a distributed stream processing system with very good support for streaming analytics. Flink features two relational APIs, the Table API and SQL. The Table API is a language-integrated relational API with stream-specific features. Flink’s SQL interface implements the plain SQL standard. Both APIs are semantically compatible and share the same optimization and execution path based on Apache Calcite.
In this talk we present the future of Apache Flink’s relational APIs for stream analytics, discuss their conceptual model, and showcase their usage. The central concept of these APIs are dynamic tables. We explain how streams are converted into dynamic tables and vice versa without losing information due to the stream-table duality. Relational queries on dynamic tables behave similar to materialized view definitions and produce new dynamic tables. We show how dynamic tables are converted back into changelog streams or are written as materialized views to external systems, such as Apache Kafka or Apache Cassandra, and are updated in place with low latency. We conclude our talk demonstrating the power and expressiveness of Flink’s relational APIs by presenting how common stream analytics use cases can be realized.
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...Evention
Since June 2016, Kafak, Spark and Flink-as-a-service have been available to researchers and companies in
Sweden from the Swedish ICT SICS Data Center at www.hops.site using the HopsWorks platform (www.hops.io). Flink and Spark applications are run within a project on a YARN cluster with the novel property that applications are metered and charged to projects. Projects are also securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In this talk we will discuss the challenges in building multi-tenant streaming applications on YARN that are metered and easy-to-debug. We show how we use the ELK stack (Elasticsearch, Logstash, and Kibana) for logging and debugging running streaming applications, how we use Graphana and Graphite for monitoring streaming applications, and how users can debug and optimize terminated Spark Streaming jobs using Dr Elephant. We will also discuss the experiences of our users (over 120 users as of Oct 2016): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications. Hopsworks is entirely UI-driven with an Apache v2 open source license.
ING CoreIntel - collect and process network logs across data centers in near ...Evention
Security is at the core of every bank activity. ING set an ambitious goal to have an insight into the overall network data activity. The purpose is to quickly recognize and neutralize unwelcomed guests such as malware, viruses and to prevent data leakage or track down misconfigured software components.
Since the inception of the CoreIntel project we knew we were going to face the challenges of capturing, storing and processing vast amount of data of a various type from all over the world. In our session we would like to share our experience in building scalable, distributed system architecture based on Kafka, Spark Streaming, Hadoop and Elasticsearch to help us achieving these goals.
Why choosing good data format matters? How to manage kafka offsets? Why dealing with Elasticsearch is a love-hate relationship for us or how we just managed to put it all these pieces together.
Real Time Data Processing at RTB House - Bartosz ŁośEvention
Our platform, which purchases and runs advertisements in the Real-Time Bidding model, processes 250K bid requests and generates 20K events per every second which gives 3TB data every day. Because of machine learning, system monitoring and financial settlements we need to filter, store, aggregate and join these events together. As a result processed events and aggregated statistics are available in Hadoop, Google BigQuery and Postgres. The most demanding are business requirements such as: events that should be joined together can appear 30 days after each other, we are not allowed to create any duplicates, we have to minimalize possible data losses as well as there could not be any differences between generated data outputs. We have designed and implemented the solution which has reduced delay of availability of this data from 1 day to 15 seconds.
We will preent: Our first approach to the problem (end-of-day batch jobs) and final solution (real-time stream processing) 2. detailed description of the current architecture 3. how we had tested new data flow before it was deployed and in which way it is being monitored now 4. our one-click deployment process 5. decisions which we made with its advantages and disadvantages and our future plans to improve our current solution.
We would like to share our experience connected with scaling solution over clusters of computers in several data centers. We will focus on the current architecture but also on testing and monitoring issues with our deployment process. Finally, we would like to provide an overview of engaged projects like Kafka, Mirrormaker, Storm, Aerospike, Flume, Docker etc. We will describe what we have achieved from given open source and some problems we have come across.
Redundancy for Big Hadoop Clusters is hard - Stuart PookEvention
Criteo had an Hadoop cluster with 39 PB raw stockage, 13404 CPUs, 105 TB RAM, 40 TB data imported per day and over 100000 jobs per day. This cluster was critical in both stockage and compute but without backups. After many efforts to increase our redundancy, we now have two clusters that, combined, have more than 2000 nodes, 130 PB, two different versions of Hadoop and 200000 jobs per day but these clusters do not yet provide a redundant solution to our all storage and compute needs. This talk discusses the choices and issues we solved in creating a 1200 node cluster with new hardware in a new data centre. Some of the challenges involved in running two different clusters in parallel will be presented. We will also analyse what went right (and wrong) in our attempt to achieve redundancy and our plans to improve our capacity to handle the loss of a data centre.
Orchestrating Big Data pipelines @ Fandom - Krystian Mistrzak Thejas MurthyEvention
Fandom is the largest entertainment fan site in the world. With more than 360,000 fan communities and a global
audience of over 190 million monthly uniques, we are the fan’s voice in entertainment. Being the largest entertainment site, wikia generates massive volumes of data, which varies from clickstream, user activities, api requests, ad delivery, A/B testing and much more. The big challenge is not just the volume but the orchestration involved in combining various sources of data with various periodicity, volumes. And Making sure the processed data is available for the consumers within the expected time. Thus helping gain the right insights well within the right time. A conscious decision was made to choose the right open source tool to solve the problem of orchestration, after evaluating various tools we decided to use Apache airflow. This presentation will give an overview of comparisons of existing tools and emphasize on why we choose airflow. And how Airflow is being used to create a stable reliable orchestration platform to enable non data engineers to seamlessly access data by democratizing data. We will focus on some tricks and best practises of developing workflows with Airflow and show how we are using some of the features of airflow.
27. Lekcje?
• Warto:
• wiedzieć co się robi
• zadawać pytania
• być trochę podejrzliwym (przynajmniej jeśli chodzi o wyniki)
• używać R i ggplot2
• Nie warto:
• ufać wykresom ,,w ciemno’’