Fakty i mity, czyli big data w portalu internetowym

•

0 likes•55 views

Evention

Andrzej Litewka, INTERIA.PL

Technology

FAKTY KONTRA MITY
CZYLI BIG DATA W PORTALU INTERNETOWYM
Andrzej Litewka
Grupa Interia

1,16
mld

PV
/
m-‐c

38
mln

Unique
Visitors

Ponad
200
usług

Źródło: Gemius Megapanel, Grudzień 2014

rok 2011
Interia
Kokpit/Target
ClickMapa
²  Coraz
więcej
danych

²  Coraz
więcej
wskaźników

²  Ograniczenia
relacyjnych
baz
danych

Jakie
problemy
napotykają
ﬁrmy
przy
wdrażaniu
big
data?

Źródlło: IDG, Raport Computerworld Polska Big Data+, Wrzesień 2014

7 serwerów
2-4 GB RAM; 6-12 TB; 1x CPU 2-4 core
2 serwery
8 GB RAM; 12 TB; 2x CPU 4 core
Jak
podeszliśmy
do
Big
Data

Hardware
Ludzie
programista, administrator, analityk
Pomysł

Click

stream

`

Logi

aplikacji

Dane

markeUngowe

Wnioski
z
pilota

•  Jest
moc
J

•  Wskazany
lepszy
sprzęt

•  Weryﬁkacja
podejścia

realizacji

– applicaUon
management

– DevOps

•  Wdrożenie
i
użytkowanie

Big
Data
należy
traktować

jako
proces

•  Szukać
praktycznego
użycia

w
produktach

Oozie

Impala
Pig
Hive
HBase
Spark

YARN

HDFS

Flume
Kaba
Sqoop
HUE
Statsman

Storm

MongoDB

couchebase

Node.JS

60

serwerów

500
TB

ponad

Dostępna
przestrzeń

Cloudera

CDH
5

Clickstream

Logi

MarkeUng

Dane
z
usług

DANE

~10
GB

na
godzinę

250
GB

dziennie

7
TB

Miesięczny
przyrost
danych

Dzienny
strumień
danych

~15
-‐20
mln
rekordów

>450
mln
rekordów

miesięcznie

Udział
ruchu
mobilnego
–
święta
Bożego
Narodzenia
2014

Informacyjne
magazyny

Udział
robotów
w
ilości
requestów

Rozkład
ilości
zakładek
ze
stronami
Interii
w
oknie
przeglądarki

Ask

See

Ask

Develop

See

Reﬁne

x

Learn

x

Data

Discovery

Traditional

BI

Identyﬁkacja

użytkownika

Przypisanie
do

segmentu

Tabele

scoringowe

CMS

Moduł

PragmaUc
Web

Generacja

Strony

Fakty i mity, czyli big data w portalu internetowym

W prezentacji omówiono czym jest Open Source, jakie są zalety używania go w instytucjach publicznych, a także po co upubliczniać kod źródłowy oprogramowania wytwarzanego na zamówienie instytucji publicznych. Omówiono też amerykański program pilotażowy z lat 2016-2019, w ramach którego agencje rządu federalnego miały obowiązek upubliczniać 20% kodu źródłowego wytwarzanego na własne potrzeby. Autorem jest mgr inż. Aleksander Korzyński, informatyk z wieloletnim doświadczeniem we wdrażaniu oprogramowania Open Source. Była to pierwsza prelekcja z całego cyklu, o którym na bieżąco będę informował na fanpage'u: https://www.facebook.com/WarsawFOSS Nagranie wideo tej prezentacji jest dostępne na YouTube i Facebooku: https://youtu.be/3R1iGT0lmDM https://www.facebook.com/WarsawFOSS/videos/210198723391788/ Grupa meetup: https://www.meetup.com/Warsaw-Free-and-Open-Source-Software-FOSS-Meetup-Group/ Event na facebooku: https://www.facebook.com/events/2866516193434036/ Zapraszam również na stronę Centralnego Domu Technologii, który umożliwił zorganizowanie tej prelekcji. https://cdt.pl/ English title: Open Source Software in the Public Sector

Afc module 5 pl

SoniaNaiba

Trendy technologiczne 2019 - Deloitte, prezentacja 26.02.2019

Deloitte Polska

Więcej: https://www2.deloitte.com/pl/pl/pages/technology/articles/tech-trends-trendy-technologiczne-2019.html Zaawansowane sieci teleinformatyczne, inteligentne interfejsy czy zastąpienie serwerów technologią chmury – między innymi te innowacje i trendy umożliwiają osiągnięcie celów, dotychczas leżących poza granicami możliwości technologicznych. Eksperci firmy doradczej Deloitte, autorzy 10. edycji raportu „Tech trends 2019 – Jak przełamać bariery technologiczne?” wytypowali 6 trendów, które radykalnie zmienią biznes w nadchodzących miesiącach. Z raportu wynika, że firmy już zaczynają korzystać z tych rozwiązań i starają się dopasować nowe trendy do swoich potrzeb biznesowych. W najbliższym czasie technologiczny pęd zdecydowanie przyspieszy za sprawą sztucznej inteligencji, która leży u podstaw kolejnych zmian.

Rola analityki danych w transformacji cyfrowej firmy - ITFuture'17

Piotr Czarnas

PLNOG 9: Przemysław Misiak - Wiodący integrator sieciowy na rynku operatorów ...

PROIDEA

Architektura serwisu gg.pl 2 przemek łącki (2)Cendoo

It od kuchni w nokaut.pl

Przemyslaw Wroblewski

IT od kuchni w Nokaut.pl

3camp

PLNOG 8: Marcin Wawrzyński - Czy deszcz może padać do chmury?

PROIDEA

Andrzej Gab - Zarządzanie adresacją IP

PROIDEA

Dzień Otwarty IBM - juz 30 listopada w Krakowie

Magdalena Michalak

Big Data w Polsce i za granicą (Big Data in Poland and worldwide)

Aleksandra Wozniak

Oprogramowanie. Sprzęt. Komplet.-prezentacja otwierająca

Alicja Sieminska

PLNOG14: Ocena wydajności i bezpieczeństwa infrastruktury operatora telekomu...

PROIDEA

Dariusz Zmysłowski - Systemics PAB Rafał Wiszniewski - Orange Polska Language: Polish Istotnym obszarem działalności Systemics-PAB jest współpraca z Orange Polska w zakresie testowania wydajności i bezpieczeństwa systemów sieciowych z wykorzystaniem rozwiązań oferowanych przez Spirent Communications. W trakcie prezentacji zostanie przedstawione praktyczne wykorzystanie produktów Avalanche i Avalanche Next w Orange Polska. W związku z rosnącą ilością zagrożeń ze strony sieci Internet rośnie liczba urządzeń kierowanych do ochrony użytkowników sieci przed niepowołaną treścią. Aby sprostać wymaganiom stawianym przez największe sieci, producenci wprowadzają na rynek coraz to bardziej zaawansowane rozwiązania, których wydajność wydaje się być wystarczająca, aby chronić sieci z dużą liczbą użytkowników. Co więcej deklarowana wydajność niektórych urządzeń wydaje się być wystarczająca nawet do uruchomienia takowych urządzeń w rdzeniu sieci. Orange Polska testuje wydajnościowo urządzenia oraz usługi naszych Klientów dostępne w sieci Internet. Poddajemy je szeregowi testów, których celem jest sprawdzenie ich wydajności i funkcjonalności, określenie słabych punktów i wąskich gardeł, przygotowanie scenariuszy awaryjnych czy zaplanowanie modernizacji i rozbudowy. Systemics – PAB specjalizuje się pomiarach jakości usług telekomunikacyjnych, usługach i dostarczaniu rozwiązań z obszaru optymalizacji, inżynierii systemów telekomunikacyjnych, bezpieczeństwa i wydajności środowisk sieciowych. Zarejestruj się na kolejną edycję PLNOG już dzisiaj: krakow.plnog.pl

Deskdoo - wirtualny system dla pracowników zdalnych

Adam Adamczyk

Pierwszy na świecie wirtualny system operacyjny w chmurze dla pracowników zdalnych. Wierzymy, że czas rozwoju systemów opracowywanych na komputery stacjonarne dobiega już końca. Nie da się już więcej dodać użytecznych rzeczy do systemów jak Windows czy nawet MacX Os. Nadchodzi czas wirtualnych systemów operacyjnych uruchamianych z dowolnego urządzenia w przeglądarce internetowej bez potrzeby instalowania czegokolwiek na komputerze stacjonarnym czy tablecie. Pomagamy ludziom lepiej zorganizować czas pozwalając pracować im efektywnie zdalnie bez potrzeby dojazdu do biura i spędzania czasu na bezowocnych spotkaniach w firmie. Zmniejszamy przez to korki w dużych miastach i pozwalamy na spędzanie więcej czasu z rodziną. Pozwalamy na rozwiązanie problemów ze spadkiem efekywności pracowników zdalnych. Wg badania Harvard Business Review z 2014 roku 82% pracowników zdalnych nie osiąga oczekiwanych celów a 33% sami określają swoją pracę jako mniej efektywną.

The Factorization Machines algorithm for building recommendation system - Paw...

Evention

One of successful examples of data science applications in the Big Data domain are recommendation systems. The goal of my talk is to present the Factorization Machines algorithm, available in the SAS Viya platform. The Factorization Machines is a good choice for making predictions and recommendations based on large sparse data, in particular specific for the Big Data. In practical part of the presentation, a low level granularity data from the NBA league will be used to build an application recommending optimal game strategies as well as predicting results of league games.

A/B testing powered by Big data - Saurabh Goyal, Booking.com

Evention

At Booking we have more than a million properties selling their rooms to our customers. We have approximately 1000 events per minute from them leading to total 500 GB of data for partner events alone. In order to make sure we receive the relevant inventory from our partners we A/B test various new features. There were more than 100 experiments focusing on availability alone in one quarter. In my talk I ll be talking about A/B testing at Booking, different technologies like Hadoop, Hbase, Cassandra, Kafka etc that we use to store and process large volumes of data and building up of metrics to measure the success of our experiments.

Similar to Fakty i mity, czyli big data w portalu internetowym

HPE Compute prezentacja 3.11.2015

Fast Forward Communication

Case study Centralny Dom Maklerski Pekao - K3 System

K3 System

HP Polska dla Biznesu wiosna 2014

HPPolskadlaBiznesu

Big Data +

Maciej Mroczek

Oracle Big Data Discovery - ludzka twarz Hadoop'a

Data Science Warsaw

Oprogramowanie Open Source w instytucjach publicznych

Aleksander Korzyński

Afc module 5 pl

SoniaNaiba

Trendy technologiczne 2019 - Deloitte, prezentacja 26.02.2019

Deloitte Polska

Rola analityki danych w transformacji cyfrowej firmy - ITFuture'17

Piotr Czarnas

PLNOG 9: Przemysław Misiak - Wiodący integrator sieciowy na rynku operatorów ...

PROIDEA

Architektura serwisu gg.pl 2 przemek łącki (2)Cendoo

It od kuchni w nokaut.pl

Przemyslaw Wroblewski

IT od kuchni w Nokaut.pl

3camp

PLNOG 8: Marcin Wawrzyński - Czy deszcz może padać do chmury?

PROIDEA

Andrzej Gab - Zarządzanie adresacją IP

PROIDEA

Dzień Otwarty IBM - juz 30 listopada w Krakowie

Magdalena Michalak

Big Data w Polsce i za granicą (Big Data in Poland and worldwide)

Aleksandra Wozniak

Oprogramowanie. Sprzęt. Komplet.-prezentacja otwierająca

Alicja Sieminska

PLNOG14: Ocena wydajności i bezpieczeństwa infrastruktury operatora telekomu...

PROIDEA

Deskdoo - wirtualny system dla pracowników zdalnych

Adam Adamczyk

Similar to Fakty i mity, czyli big data w portalu internetowym (20)

HPE Compute prezentacja 3.11.2015

Case study Centralny Dom Maklerski Pekao - K3 System

HP Polska dla Biznesu wiosna 2014

Big Data +

Oracle Big Data Discovery - ludzka twarz Hadoop'a

Oprogramowanie Open Source w instytucjach publicznych

Afc module 5 pl

Trendy technologiczne 2019 - Deloitte, prezentacja 26.02.2019

Rola analityki danych w transformacji cyfrowej firmy - ITFuture'17

PLNOG 9: Przemysław Misiak - Wiodący integrator sieciowy na rynku operatorów ...

Architektura serwisu gg.pl 2 przemek łącki (2)

It od kuchni w nokaut.pl

IT od kuchni w Nokaut.pl

PLNOG 8: Marcin Wawrzyński - Czy deszcz może padać do chmury?

Andrzej Gab - Zarządzanie adresacją IP

Dzień Otwarty IBM - juz 30 listopada w Krakowie

Big Data w Polsce i za granicą (Big Data in Poland and worldwide)

Oprogramowanie. Sprzęt. Komplet.-prezentacja otwierająca

PLNOG14: Ocena wydajności i bezpieczeństwa infrastruktury operatora telekomu...

Deskdoo - wirtualny system dla pracowników zdalnych

More from Evention

The Factorization Machines algorithm for building recommendation system - Paw...

Evention

A/B testing powered by Big data - Saurabh Goyal, Booking.com

Evention

Near Real-Time Fraud Detection in Telecommunication Industry - Burak Işıklı, ...

Evention

In general, fraud is the common painful area in the telecom sector, and detecting fraud is like finding a needle in the haystack due to volume and velocity of data. There are 2 key factors to detect fraud: (1). Speed: If you can’t detect in time, you’re doomed to loose because they’ve already got what they need. Simbox detection is one of the use case for this situation. Frauders use it to bypass interconnection fee. In this use case we’re talking about our real time architecture using Spark SQL to detect simbox within 5 minutes. (2). Accuracy: Frauders changes their method all the time. But our job is finding their behaviour using machine learning algorithms accurately. Anomaly detection is one of the use case for this situation. In this use case we’re talking about data mining architecture to make fraud models using Spark ML within 1 hour. We also discuss some ML algorithm performance on Spark such as K-means, three sigma rule, T-digest and so on. In order to accomplish these factors, we processes 8-10 billion records which size is 4-5 TB every day. Our solution combines end-to-end data ingestion, processing, and mining the high volume data to detect some use cases of fraud in near real time using CDR and IPTDR to save millions, and better user experience.

Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...

Evention

Nowadays many companies become data rich and intensive. They have millions of users generating billions of interactions and events per day. These massive streams of complex events can be processed and reacted upon to e.g. offer new products, next best actions, communicate to users or detect frauds, and quicker we can do it, the higher value we can generate. In this talk we will present, how in joint development with our client and in just few months effort we have built from ground up a complex event processing platform for their intensive data streams. We will share how the system runs marketing campaigns or detect frauds by following behavior of millions users in real-time and reacting on it instantly. The platform designed and built with Big Data technologies to infinitely and cost-effectively scale already ingests and processes billions of messages or terabytes of data per day on a still small cluster. We will share how we leveraged the current best of breed open-source projects including Apache Flink, Apache Nifi and Apache Kafka, but also what interesting problems we needed to solve. Finally, we will share where we’re heading next, what next use cases we’re going to implement and how.

Machine learning security - Pawel Zawistowski, Warsaw University of Technolog...

Evention

Despite rapid progress of tools and methods, security has been almost entirely overlooked in the mainstream machine learning. Unfortunately, even the most sophisticated and carefully crafted models can become victims of using the so-called adversarial examples. This talk will cover the concepts of adversarial data and machine learning security, go through examples of possible attack vectors and discuss the currently known defence mechanisms.

Building a Modern Data Pipeline: Lessons Learned - Saulius Valatka, Adform

Evention

Adform is one of the biggest European ad-tech companies – for example, our RTB engine at peak handles ~1m requests per second, each in under 100 ms, producing ~20TB of data daily. In this talk I will present the data pipeline and the infrastructure behind it, emphasizing our core principles (such as event sourcing, immutability, correctness) as well as the lessons learned along the way while building it and the state it is converging to.

Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans

Evention

This talk will start with brief introduction to streaming processing and Flink itself. Next, we will take a look at some of the most interesting recent improvements in Flink such as incremental checkpointing, end-to-end exactly-once processing guarantee and network latency optimizations. We’ll discuss real problems that Flink’s users were facing and how they were addressed by the community and dataArtisans.

Privacy by Design - Lars Albertsson, Mapflat

Evention

Privacy and personal integrity has become a focus topic, due to the upcoming GDPR deadline in May 2018 and it’s requirements for data storage, retention, and access. This talk provides an engineering perspective on privacy and highlights pitfalls and topics that require early attention. The content of the talk is based on real world experience from handling privacy protection in large scale data processing environments.

Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...

Evention

The way you operate your Big Data environment is not going to be the same anymore. This session is based on our experience managing on-premise environments and taking the lesson from innovative data-driven companies that successfully migrated their multi PB Hadoop clusters. Where to start and what decisions you have to make to gradually becoming cloud ready. The examples would refer to Google Cloud Platform yet the challenges are common.

Deriving Actionable Insights from High Volume Media Streams - Jörn Kottmann, ...

Evention

Enhancing Spark - increase streaming capabilities of your applications - Kami...

Evention

During this session we’ll discuss the pros and cons of a new structured streaming data processing model in Spark and a nifty way of enhancing Spark with SnappyData, an open-source framework providing great features for both persistent and in-motion data analysis. Based on a real-life use case, where we designed and implemented a streaming application filtering, consuming and aggregating tons of events, we will talk the role of the persistent back-end and stream processing integration in the real-time applications in terms of performance, robustness and scalability of the solution.

7 Days of Playing Minesweeper, or How to Shut Down Whistleblower Defense with...

Evention

The next time you find yourself thinking there isn’t enough time in a week, consider what Drinker Biddle did for their client in 7 days. When a senior executive for a publicly traded company was fired for underperformance, he made a serious allegation on his way out the door. He claimed he was laid off because of his repeated attempts to inform officials that the company was falsifying quarterly financial reports to the public. Instead of waiting for the typical pace of discovery that could potentially cost their client at least a quarter of a million dollars, Drinker Biddle used powerful analytics technology to conduct an intelligent investigation, fast. In this session, you will learn about machine learning that makes digging through large multi-sources data sets possible. You will have a chance to see the backstage of how engineers empower legal teams to organize data, discover the truth and act on it.

Big Data Journey at a Big Corp - Tomasz Burzyński, Maciej Czyżowicz, Orange P...

Evention

We will present the journey of Orange Polska evolving from a proprietary ecosystem towards significantly open-source ecosystem based on Hadoop and friends – a journey particularly challenging at a large corporation. We’ll present key drivers for starting Big Data, evolution of BI, emergence of Data Scientists and advanced analytics along with operational reporting and stream processing to detect issues. This presentation will cover both technical aspects and business environment, as both are inherently linked in process of big data enterprise adoption.

Stream processing with Apache Flink - Maximilian Michels Data Artisans

Evention

Apache Flink is an open source platform for distributed stream and batch data processing. At its core, Flink is a streaming dataflow engine which provides data distribution, communication, and fault tolerance for distributed computations over data streams. On top of this core, APIs make it easy to develop distributed data analysis programs. Libraries for graph processing or machine learning provide convenient abstractions for solving large-scale problems. Apache Flink integrates with a multitude of other open source systems like Hadoop, databases, or message queues. Its streaming capabilities make it a perfect fit for traditional batch processing as well as state of the art stream processing.

Scaling Cassandra in all directions - Jimmy Mardell Spotify

Evention

Big Data for unstructured data Dariusz Śliwa

Evention

Źródłami dla Big Data są zwykle ustrukturalizowane dane, pochodzące z innych systemów i z mechanizmów śledzących kanały interakcji z klientami (lub urządzeniami w przypadku M2M). A co z olbrzymim potencjałem drzemiącym w przepastnych zasobach informacji nieustrukturalizowanej? Jak wydobyć biznesową wartość i zamienić koszt (składowania) takich danych na rzeczywiste aktywa firmy? Poza tradycyjnymi narzędziami analizy Big Data (HPE IDOL czy Vertica) firma Hewlett Packard Enterprise oferuje technologie dla informacji niestrukturalnych. Klasyfikacja i analityka plików oferowana przez HPE ControlPoint pozwala na łatwą ocenę jakości informacji niestrukturalnych oraz na szybkie odsianie zbędnych danych (redundant, obsolete, trivial and dark data). HPE Investigative Analytics łączy źródła danych i analizy nie tylko za pomocą modeli behavioralnych, ale uzupełnia ten obraz o Analizę Nastroju (Sentiment Analysis) oraz Intencje (Intent)

Elastic development. Implementing Big Data search Grzegorz Kołpuć

Evention

H2 o deep water making deep learning accessible to everyone -jo-fai chow

Evention

Deep Water is H2O’s integration with multiple open source deep learning libraries such as TensorFlow, MXNet and Caffe. On top of the performance gains from GPU backends, Deep Water naturally inherits all H2O properties in scalability. ease of use and deployment. In this talk, I will go through the motivation and benefits of Deep Water. After that, I will demonstrate how to build and deploy deep learning models with or without programming experience using H2O’s R/Python/Flow (Web) interfaces.

That won’t fit into RAM - Michał Brzezicki

Evention

SentiOne is one of the leading solutions in Europe for social media listening and analysis. We monitor over 26 European markets including CEE, Scandinavia, DACH, and the Balkans. The amount of data that is processed every day and is ready to be queried by our users is enormous. Over the years we have tested many technologies and approaches in big data from which many have failed. The presentation includes our experiences and lessons learned on setting up big data company from scratch. I will give details on configuring robust ElasticSearch cluster with over 26TB of data and describe key challenges in efficient web crawling and data extraction

Stream Analytics with SQL on Apache Flink - Fabian Hueske

Evention

SQL is undoubtedly the most widely used language for data analytics for many good reasons. It is declarative, many database systems and query processors feature advanced query optimizers and highly efficient execution engines, and last but not least it is the standard that everybody knows and uses. With stream processing technology becoming mainstream a question arises: “Why isn’t SQL widely supported by open source stream processors?”. One answer is that SQL’s semantics and syntax have not been designed with the characteristics of streaming data in mind. Consequently, systems that want to provide support for SQL on data streams have to overcome a conceptual gap. One approach is to support standard SQL which is known by users and tools but comes at the cost of cumbersome workarounds for many common streaming computations. Other approaches are to design custom SQL-inspired stream analytics languages or to extend SQL with streaming-specific keywords. While such solutions tend to result in more intuitive syntax, they suffer from not being established standards and thereby exclude many users and tools. Apache Flink is a distributed stream processing system with very good support for streaming analytics. Flink features two relational APIs, the Table API and SQL. The Table API is a language-integrated relational API with stream-specific features. Flink’s SQL interface implements the plain SQL standard. Both APIs are semantically compatible and share the same optimization and execution path based on Apache Calcite. In this talk we present the future of Apache Flink’s relational APIs for stream analytics, discuss their conceptual model, and showcase their usage. The central concept of these APIs are dynamic tables. We explain how streams are converted into dynamic tables and vice versa without losing information due to the stream-table duality. Relational queries on dynamic tables behave similar to materialized view definitions and produce new dynamic tables. We show how dynamic tables are converted back into changelog streams or are written as materialized views to external systems, such as Apache Kafka or Apache Cassandra, and are updated in place with low latency. We conclude our talk demonstrating the power and expressiveness of Flink’s relational APIs by presenting how common stream analytics use cases can be realized.

More from Evention (20)

The Factorization Machines algorithm for building recommendation system - Paw...

A/B testing powered by Big data - Saurabh Goyal, Booking.com

Near Real-Time Fraud Detection in Telecommunication Industry - Burak Işıklı, ...

Assisting millions of active users in real-time - Alexey Brodovshuk, Kcell; K...

Machine learning security - Pawel Zawistowski, Warsaw University of Technolog...

Building a Modern Data Pipeline: Lessons Learned - Saulius Valatka, Adform

Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans

Privacy by Design - Lars Albertsson, Mapflat

Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...

Deriving Actionable Insights from High Volume Media Streams - Jörn Kottmann, ...

Enhancing Spark - increase streaming capabilities of your applications - Kami...

7 Days of Playing Minesweeper, or How to Shut Down Whistleblower Defense with...

Big Data Journey at a Big Corp - Tomasz Burzyński, Maciej Czyżowicz, Orange P...

Stream processing with Apache Flink - Maximilian Michels Data Artisans

Scaling Cassandra in all directions - Jimmy Mardell Spotify

Big Data for unstructured data Dariusz Śliwa

Elastic development. Implementing Big Data search Grzegorz Kołpuć

H2 o deep water making deep learning accessible to everyone -jo-fai chow

That won’t fit into RAM - Michał Brzezicki

Stream Analytics with SQL on Apache Flink - Fabian Hueske

Fakty i mity, czyli big data w portalu internetowym

1. FAKTY KONTRA MITY CZYLI BIG DATA W PORTALU INTERNETOWYM Andrzej Litewka Grupa Interia

2. 1,16 mld PV / m-‐c 38 mln Unique Visitors Ponad 200 usług Źródło: Gemius Megapanel, Grudzień 2014

4. rok 2011 Interia Kokpit/Target ClickMapa ²  Coraz więcej danych ²  Coraz więcej wskaźników ²  Ograniczenia relacyjnych baz danych

5. Jakie problemy napotykają ﬁrmy przy wdrażaniu big data? Źródlło: IDG, Raport Computerworld Polska Big Data+, Wrzesień 2014

6. 7 serwerów 2-4 GB RAM; 6-12 TB; 1x CPU 2-4 core 2 serwery 8 GB RAM; 12 TB; 2x CPU 4 core Jak podeszliśmy do Big Data Hardware Ludzie programista, administrator, analityk Pomysł

7. Click stream ` Logi aplikacji Dane markeUngowe

8. Wnioski z pilota •  Jest moc J •  Wskazany lepszy sprzęt •  Weryﬁkacja podejścia realizacji – applicaUon management – DevOps •  Wdrożenie i użytkowanie Big Data należy traktować jako proces •  Szukać praktycznego użycia w produktach

9. Oozie Impala Pig Hive HBase Spark YARN HDFS Flume Kaba Sqoop HUE Statsman Storm MongoDB couchebase Node.JS 60 serwerów 500 TB ponad Dostępna przestrzeń Cloudera CDH 5

10. Clickstream Logi MarkeUng Dane z usług DANE ~10 GB na godzinę 250 GB dziennie 7 TB Miesięczny przyrost danych

11.

12. Dzienny strumień danych ~15 -‐20 mln rekordów >450 mln rekordów miesięcznie

13. Udział ruchu mobilnego – święta Bożego Narodzenia 2014

14. Informacyjne magazyny Udział robotów w ilości requestów

15. Rozkład ilości zakładek ze stronami Interii w oknie przeglądarki

16.

17. Ask See Ask Develop See Reﬁne x Learn x Data Discovery Traditional BI

18.

19. Analiza Real Time

20. Guarana

21. Identyﬁkacja użytkownika Przypisanie do segmentu Tabele scoringowe CMS Moduł PragmaUc Web Generacja Strony

Fakty i mity, czyli big data w portalu internetowym

Recommended

Recommended

More Related Content

Similar to Fakty i mity, czyli big data w portalu internetowym

Similar to Fakty i mity, czyli big data w portalu internetowym (20)

More from Evention

More from Evention (20)

Fakty i mity, czyli big data w portalu internetowym