Next Generation Big Data Platform at Netflix 2014

•Download as PPTX, PDF•

8 likes•3,820 views

The document appears to be notes from an event at Netflix discussing cloud applications and storage. It includes details on work with S3 storage, tools like Tez and MR compilers, a distributed SQL query engine, and upcoming releases of S3 version 2.0. The notes also list a number of JIRA tickets and committed/pending pull requests related to various projects. The bottom section lists talks at the event related to cloud computing topics.

Technology

Eva Tse, Netflix
November 12, 2014 | Las Vegas, LV

Cloud
apps
Event Data
Suro Ursula
Cassandra
Aegisthus
15 min
Dimension Data
Daily
S3
SS Tables

YARN-1864
YARN-2026
YARN-2012
YARN-2214
YARN-2360
YARN-2540

TezCompiler MRCompiler
Tez Plan
Logical Plan
Physical Plan
Tez Execution Engine
MR Plan
MR Execution Engine
d

A Distributed SQL Query Engine for Big Data

YARN-1864
YARN-2026
YARN-2012
YARN-2214
YARN-2360
YARN-2540
HIVE-6783
HIVE-6785
HIVE-6938
HIVE-7800
PARQUET-100
PARQUET-106
PARQUET-2
PARQUET-22
PARQUET-70
PARQUET-75
PARQUET-92
PARQUET-99
PIG-3986

Talk Time Title
PFC-305 Wednesday, 1:15pm Embracing Failure: Fault Injection and Service Reliability
BDT-403 Wednesday, 2:15pm Next Generation Big Data Platform at Netflix
PFC-306 Wednesday, 3:30pm Performance Tuning EC2
DEV-309 Wednesday, 3:30pm From Asgard to Zuul, How Netflix’s proven Open Source
Tools can accelerate and scale your services
ARC-317 Wednesday, 4:30pm Maintaining a Resilient Front-Door at Massive Scale
PFC-304 Wednesday, 4:30pm Effective Inter-process Communications in the Cloud: The
Pros and Cons of Micro Services Architectures
ENT-209 Wednesday, 4:30pm Cloud Migration, Dev-Ops and Distributed Systems
APP-310 Friday, 9:00am Scheduling using Apache Mesos in the Cloud

Next Generation Big Data Platform at Netflix 2014

As Netflix expands their services to more countries, devices, and content, they continue to evolve their big data analytics platform to accommodate the increasing needs of product and consumer insights. This year, Netflix re-innovated their big data platform: they upgraded to Hadoop 2, transitioned to the Parquet file format, experimented with Pig on Tez for the ETL workload, and adopted Presto as their interactive querying engine. In this session, Netflix discusses their latest architecture, how they built it on the Amazon EMR infrastructure, the contributions put into the open source community, as well as some performance numbers for running a big data warehouse with Amazon S3.

BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012

Amazon Web Services

In this talk, we dive into the Netflix Data Science & Engineering architecture. Not just the what, but also the why. Some key topics include the big data technologies we leverage (Cassandra, Hadoop, Pig + Python, and Hive), our use of Amazon S3 as our central data hub, our use of multiple persistent Amazon Elastic MapReduce (EMR) clusters, how we leverage the elasticity of AWS, our data science as a service approach, how we make our hybrid AWS / data center setup work well, and more.

Data Science at Netflix with Amazon EMR (BDT306) | AWS re:Invent 2013

Amazon Web Services

A few years ago, Netflix had a fairly classic business intelligence tech stack. Now, things have changed. Netflix is a heavy user of AWS for much of its ongoing operations, and Data Science & Engineering (DSE) is no exception. In this talk, we dive into the Netflix DSE architecture: what and why. Key topics include their use of Big Data technologies (Cassandra, Hadoop, Pig + Python, and Hive); their Amazon S3 central data hub; their multiple persistent Amazon EMR clusters; how they benefit from AWS elasticity; their data science-as-a-service approach, how they made a hybrid AWS/data center setup work well, their open-source Hadoop-related software, and more.

Presto @ Netflix: Interactive Queries at Petabyte Scale

DataWorks Summit

(BDT210) Building Scalable Big Data Solutions: Intel & AOL

Amazon Web Services

"Growing data is a massive computational challenge across the enterprise. The opportunity to draw insights from huge data sets is wide open, but traditional computing environments often can’t scale to those volumes. In this session, Intel Chief Data Scientist Bob Rogers PhD explains how developers can take advantage of technologies from Intel with the AWS platform. Also in this session, AOL Systems Architect Durga Nemani provides insights into how AOL was able to reduce the time and cost to process massive amounts of clickstream data by leveraging big data technologies in AWS. AOL can process data as fast as possible or as cheaply as possible, depending on the SLA, by choosing the number and types of instances without any changes to the code. Session sponsored by Intel."

Analytics at Scale with Apache Spark on AWS with Jonathan Fritz

Databricks

Organizations from small startups to large enterprises are rapidly adopting Apache Spark on Amazon EMR in Amazon Web Services (AWS) to run streaming analytics, data science, machine learning, and batch processing workloads. These customers can quickly create big data architectures within minutes, and decouple compute and storage with Amazon S3 as a highly scalable, durable, and secure data lake, lower costs using Amazon EC2 Spot Instances and Auto Scaling, and utilize a wide range of encryption and access control features. In this session, we discuss how customers are using Spark on AWS and common architectures for easily running performant Spark clusters at scale and low cost with Amazon EMR.

(BDT303) Running Spark and Presto on the Netflix Big Data Platform

Amazon Web Services

In this session, we discuss how Spark and Presto complement the Netflix big data platform stack that started with Hadoop, and the use cases that Spark and Presto address. Also, we discuss how we run Spark and Presto on top of the Amazon EMR infrastructure; specifically, how we use Amazon S3 as our data warehouse and how we leverage Amazon EMR as a generic framework for data-processing cluster management.

Join us for a for a Amazon Kinesis tutorial webinar. In this session we will provide a reference architecture and instructions for building a system that performs real-time sliding-windows analysis over streaming clickstream data. We will use Amazon Kinesis for managed ingestion of streaming data at scale with the ability to replay past data, and run sliding-window computation using Apache Storm. We’ll demonstrate in the webinar on how to build the system and deploy on AWS and walkthrough all the steps from ingestion, processing, and storing to visualizing of the data in real-time.

Spark Summit EU talk by John Musser

Spark Summit

AWS re:Invent 2016: How Amazon S3 Storage Management Helps Optimize Storage a...

Amazon Web Services

Customers using Amazon S3 at large scale benefit greatly from storage management features. Storage lifecycle policies help them reduce storage costs. Cross-region replication makes it easier to copy data between AWS regions for compliance or disaster recovery. Event notifications allow automatic initiation of processes on objects as they arrive, or capture information about objects and log it for security purposes. In this session, you'll learn about these features, and also several new storage management features in Amazon S3 that give users unmatched visibility into what data they are storing and how that data is being used. These new features make it simpler to analyze usage by users, apps, or organizations, to highlight anomalies, and to optimize business process workflows. They also help identify opportunities to reduce costs, improve performance, and archive infrequently used data. In addition, they can provide insight into who is accessing data stored in S3. As part of this talk, AWS customer Pinterest shows how they have been able to leverage many of the new S3 storage management features to reduce their storage costs significantly by moving a large amount of their data from S3 Standard to S3 Standard – Infrequent Access storage.

Trends for Big Data and Apache Spark in 2017 by Matei Zaharia

Spark Summit

Big data remains a rapidly evolving field with new applications and infrastructure appearing every year. In this talk, I’ll cover new trends in 2016 / 2017 and how Apache Spark is moving to meet them. In particular, I’ll talk about work Databricks is doing to make Apache Spark interact better with native code (e.g. deep learning libraries), support heterogeneous hardware, and simplify production data pipelines in both streaming and batch settings through Structured Streaming.

SF Big Analytics: Machine Learning with Presto by Christopher Berner

Chester Chen

Talk 1: Machine Learning in Presto Presto is an open source distributed SQL query engine used by Facebook, in our Hadoop warehouse. It's typically about 10x faster than Hive, and can be extended to a number of other use cases. One of these extensions adds SQL functions to create and make predictions with machine learning models. The aim of this is to significantly reduce the time it takes to prototype a model, by moving the construction and testing of the model to the database. Bio: Christopher Berner works as a software engineer at Facebook on the Presto team. He wrote the ML functionality, and has worked on the query planner, type system, bytecode generator, and many other pieces of Presto. Before Presto he worked on the newsfeed ranking team developing machine learning models.

AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017

Monal Daxini

Over 100 million subscribers from over 190 countries enjoy the Netflix service. This leads to over a trillion events, amounting to 3 PB, flowing through the Keystone infrastructure to help improve customer experience and glean business insights. The self-serve Keystone stream processing service processes these messages in near real-time with at-least once semantics in the cloud. This enables the users to focus on extracting insights, and not worry about building out scalable infrastructure. I’ll share the details about this platform, and our experience building it.

Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...

Databricks

Almost all organizations now have a need for data science and, as such, the main challenge after determining the algorithm is to scale it up and make it operational. Comcast uses several tools and technologies such as Python, R, SaS, H2O and so on. In this session, they’ll show how many common use cases use the common algorithms like Logistic Regression, Random Forest, Decision Trees, Clustering, NLP, etc. Apache Spark has several machine learning algorithms built in and has excellent scalability. Hence, at Comcast, they built a platform to provide DSaaS on top of Spark with REST API as a means of controlling and submitting jobs, so as to abstract most users from the rigor of writing (repeating) code, instead focusing on the actual requirements. Learn how they solved some of the problems of establishing feature vectors, choosing algorithms and then deploying models into production. They’ll also showcase their use of Scala, R and Python to implement models using language of choice yet deploying quickly into production on 500-node Spark clusters.

Lambda architecture

Szilveszter Molnár

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...

Helena Edelson

Regardless of the meaning we are searching for over our vast amounts of data, whether we are in science, finance, technology, energy, health care…, we all share the same problems that must be solved: How do we achieve that? What technologies best support the requirements? This talk is about how to leverage fast access to historical data with real time streaming data for predictive modeling for lambda architecture with Spark Streaming, Kafka, Cassandra, Akka and Scala. Efficient Stream Computation, Composable Data Pipelines, Data Locality, Cassandra data model and low latency, Kafka producers and HTTP endpoints as akka actors...

What's New with Big Data Analytics

Amazon Web Services

What's New with Big Data Analytics 亞馬遜 AWS 於 2018 年 11 月底在美國拉斯維加斯所舉辦的第七屆 AWS re:Invent 2018 大會，在 AWS 客戶、合作夥伴、媒體人士、產業分析師及 AWS 員工共襄盛舉下，與會人數再創新高，超過 5 萬人。會中 AWS 發布超過 20 款雲端方案，且一半以上專攻雲端 AI、機器學習、物聯網，包括對 SageMaker 強化更多進階功能，推出第一款專用的機器學習推論晶片、加入深度的機器學習運算法支援，及其他包括儲存、資料庫、混合雲、邊緣運算 IoT 等解決方案。而具備微型機器學習能力的迷你自駕遙控車 DeepRacer 的現身，驚人之舉不僅抓人眼球，深入客戶體驗的用心，更成功抓住全球使用者的心。為讓您與全球先進技術同步，共享最新趨勢資訊，解決您開發機器學習和發展 AIoT 所遇到的難題，AWS 台灣團隊將於 2019 年 1 月 31 日 (四) 舉辦《AWS re:Invent 2018 Recap 台北》，特別嚴選最適切國內諸位先進和企業需求的內容，從「技術創新」、「AIoT」兩大分組議程，發表 AWS 的新服務和新方案。大會除了邀請亞馬遜 AWS 大中華區首席雲計算企業顧問 (Principal Evangelist) 張俠博士分享 AWS 的解決方案藍圖外，眾多 AWS 資深專家也將分享包含機器學習、深度學習推理加速等新方案，完全託管的文件系統、資料庫，無伺服器、容器技術與安全性，以及大數據與分析、物聯網服務應用、儲存方案等最新技術。歡迎您親臨會場，全方位體驗 AWS 新服務將能為您創造的驚人創新之效益。

Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...

Databricks

We are witnessing a proliferation of big data, which has lead to a zoo of data processing systems. Each system providing a different set of features. For example, Spark provides scalability to analytic tasks, but Java 8 Streams provides low-latency. Furthermore, complex applications, such as ETL and ML, are now requiring a mixture of platforms to perform tasks efficiently. In such complex data analytics pipelines, the use of multiple data processing system is not only for performance reasons, but also because of data diversity. Datasets often natively reside on different data formats and storage engines. Unfortunately, developers are left alone in the challenging tasks of: (1) choosing the right platform for their applications; and (2) performing tedious and costly data migration and integration tasks to obtain the results. In this talk, we will present Rheem, an open source scalable cross-platform system that frees developers from these burdens. Rheem provides an abstraction layer on top of Spark (and other processing platforms) with the aim of enabling cross-platform optimization and interoperability. It automatically selects the best data processing platforms for a given task and also handles the cross-platform execution. In particular, we will discuss how Rheem allows Spark to work in tandem with other platforms in order to achieve higher performance. We will also show how easy a developer can write complex applications on top of Rheem to seamlessly use multiple different data processing platforms according to their tasks at hand. Using Rheem developers do not have to worry about the integration or data migration between Spark and other platforms.

Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...

Databricks

Dr. Elephant helps improve Spark and Hadoop developer productivity and increase cluster efficiency by making clear recommendations on how to tune workloads and configurations. Originally developed by LinkedIn, Dr. Elephant is now in use at multiple sites. This session will explore how Dr. Elephant works, the data it collects from Spark environments and the customizable heuristics that generate tuning recommendations. Learn how Dr. Elephant can be used to improve production cluster operations, help developers avoid common issues, and green light applications for use on production clusters.

Sa introduction to big data pipelining with cassandra & spark west mins...

Simon Ambridge

Querying Data Pipeline with AWS Athena

Yaroslav Tkachenko

Data Pipeline team at Demonware (Activision) has to deal with routing large amounts of data from various sources to many destinations every day. Our team always wanted to be able to query processed data for debugging and analytical purposes, but creating large data warehouses was never our priority, since it usually happens downstream. AWS Athena is completely serverless query service that doesn't require any infrastructure setup or complex provisioning. We just needed to save some of our data streams to AWS S3 and define a schema. Just a few simple steps, but in the end we were able to write complex SQL queries against gigabytes of data and get results in seconds. In this presentation I want to show multiple ways to stream your data to AWS S3, explain some underlying tech, show how to define a schema and finally share some of the best practices we applied.

Kafka Lambda architecture with mirroringAnant Rustagi

Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...

Amazon Web Services

Learn how to deploy a managed Presto environment to interactively query log data on AWS Organizations often need to quickly analyze large amounts of data, such as logs, generated from a wide variety of sources and formats. However, traditional approaches require a lot of time and effort designing complex data transformation and loading processes; and configuring data warehouses. Using AWS, you can start querying your datasets within minutes In this webinar you will learn how you can deploy a managed Presto environment in minutes to interactively query log data using plain ANSI SQL. Presto is a popular open source SQL engine for running interactive analytic queries against data sources of all sizes. We will talk about common use cases and best practices for running Presto on Amazon EMR. Learning Objectives: • Learn how to deploy a managed Presto environment running on Amazon EMR • Understand best practices for running Presto on Amazon EMR, including use of Amazon EC2 Spot instances • Learn how other customers are using Presto to analyze large data sets

Databases & Analytics AWS re:invent 2019 Recap

Sungmin Kim

Spark Summit EU talk by Rolf Jagerman

Spark Summit

Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite

Gigaom

Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra

Natalino Busa

We present a solution for streaming anomaly detection, named “Coral”, based on Spark, Akka and Cassandra. In the system presented, we run Spark to run the data analytics pipeline for anomaly detection. By running Spark on the latest events and data, we make sure that the model is always up-to-date and that the amount of false positives is kept low, even under changing trends and conditions. Our machine learning pipeline uses Spark decision tree ensembles and k-means clustering. Once the model is trained by Spark, the model’s parameters are pushed to the Streaming Event Processing Layer, implemented in Akka. The Akka layer will then score 1000s of event per seconds according to the last model provided by Spark. Spark and Akka communicate which each other using Cassandra as a low-latency data store. By doing so, we make sure that every element of this solution is resilient and distributed. Spark performs micro-batches to keep the model up-to-date while Akka detects the new anomalies by using the latest Spark-generated data model. The project is currently hosted on Github. Have a look at : http://coral-streaming.github.io

AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...

Amazon Web Services

Startups around the world use AWS services to access the power of the cloud to grow faster and more cost effectively. In this session, Smartsheet talks about how they were able to cost-effectively build their prototype for scale and avoid replatforming at different points in the adoption curve, and Quantcast discusses how they are running a high-performance analytics solution on AWS. They provide several tips and tricks for S3, and show how they removed a traditional MySQL data store from a distributed-image hosting application so that the only required data store is S3. They also show how to avoid common, cumbersome database practices by working with the eventually consistent nature of S3 objects and the fact that objects and directories share the same namespace.

(ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent ...

Amazon Web Services

The Netflix service supports more than 50 million subscribers in over 40 countries around the world. These subscribers use more than 1,000 different device types to connect to Netflix, resulting in massive amounts of traffic to the service. In our distributed environment, the gateway service that receives this customer traffic needs to be able to scale in a variety of ways while simultaneously protecting our subscribers from failures elsewhere in the architecture. This talk will detail how the Netflix front door operates, leveraging systems like Hystrix, Zuul, and Scryer to maximize the AWS infrastructure and to create a great streaming experience.

What's hot

AWS Webcast - Amazon Kinesis and Apache Storm

Amazon Web Services

Spark Summit EU talk by John Musser

Spark Summit

AWS re:Invent 2016: How Amazon S3 Storage Management Helps Optimize Storage a...

Amazon Web Services

Trends for Big Data and Apache Spark in 2017 by Matei Zaharia

Spark Summit

SF Big Analytics: Machine Learning with Presto by Christopher Berner

Chester Chen

AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017

Monal Daxini

Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...

Databricks

Lambda architecture

Szilveszter Molnár

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...

Helena Edelson

What's New with Big Data Analytics

Amazon Web Services

Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...

Databricks

Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...

Databricks

Sa introduction to big data pipelining with cassandra & spark west mins...

Simon Ambridge

Querying Data Pipeline with AWS Athena

Yaroslav Tkachenko

Kafka Lambda architecture with mirroringAnant Rustagi

Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...

Amazon Web Services

Databases & Analytics AWS re:invent 2019 Recap

Sungmin Kim

Spark Summit EU talk by Rolf Jagerman

Spark Summit

Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite

Gigaom

Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra

Natalino Busa

What's hot (20)

AWS Webcast - Amazon Kinesis and Apache Storm

Spark Summit EU talk by John Musser

AWS re:Invent 2016: How Amazon S3 Storage Management Helps Optimize Storage a...

Trends for Big Data and Apache Spark in 2017 by Matei Zaharia

SF Big Analytics: Machine Learning with Presto by Christopher Berner

AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017

Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...

Lambda architecture

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...

What's New with Big Data Analytics

Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...

Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...

Sa introduction to big data pipelining with cassandra & spark west mins...

Querying Data Pipeline with AWS Athena

Kafka Lambda architecture with mirroring

Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...

Databases & Analytics AWS re:invent 2019 Recap

Spark Summit EU talk by Rolf Jagerman

Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite

Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra

Similar to Next Generation Big Data Platform at Netflix 2014

AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...

Amazon Web Services

(ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent ...

Amazon Web Services

Data Science with Elastic MapReduce (EMR) at NetflixKurt Brown

Scaling Data Quality @ Netflix

Michelle Ufford

Netflix is a famously data-driven company. Data is used to make informed decisions on everything from content acquisition to content delivery, and everything in-between. As with any data-driven company, it’s critical that data used by the business is accurate. Or, at worst, that the business has visibility into potential quality issues as soon as they arise. But even in the most mature data warehouses, data quality can be hard. How can we ensure high quality in a cloud-based, internet-scale, modern big data warehouse employing a variety of data engineering technologies? In this talk, Michelle Ufford will share how the Data Engineering & Analytics team at Netflix is doing exactly that. We’ll kick things off with a quick overview of Netflix’s analytics environment, then dig into the architecture of our current data quality solution. We’ll cover what worked, what didn’t work so well, and what we're working on next. We’ll conclude with some tips & lessons learned for ensuring high quality on big data. This talk was presented at DataWorks/Hadoop Summit 2017 on June 13, 2017.

Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix

DataWorks Summit

Netflix is a famously data-driven company. Data is used to make informed decisions on everything from content acquisition to content delivery, and everything in-between. As with any data-driven company, it’s critical that data used by the business is accurate. Or, at worst, that the business has visibility into potential quality issues as soon as they arise. But even in the most mature data warehouses, data quality can be hard. How can we ensure high quality in a cloud-based, internet-scale, modern big data warehouse employing a variety of data engineering technologies? In this talk, Michelle Ufford will share how the Data Engineering & Analytics team at Netflix is doing exactly that. We’ll kick things off with a quick overview of Netflix’s analytics environment, then dig into details of our data quality solution. We’ll cover what worked, what didn’t work so well, and what we plan to work on next. We’ll conclude with some tips and lessons learned for ensuring data quality on big data.

SQL Analytics Powering Telemetry Analysis at Comcast

Databricks

Comcast is one of the leading providers of communications, entertainment, and cable products and services. At the heart of it is Comcast RDK providing the backbone of telemetry to the industry. RDK (Reference Design Kit) is pre-bundled opensource firmware for a complete home platform covering video, broadband and IoT devices. RDK team at Comcast analyzes petabytes of data, collected every 15 minutes from 70 million devices (video and broadband and IoT devices) installed in customer homes. They run ETL and aggregation pipelines and publish analytical dashboards on a daily basis to reduce customer calls and firmware rollout. The analysis is also used to calculate WIFI happiness index which is a critical KPI for Comcast customer experience. In addition to this, RDK team also does release tracking by analyzing the RDK firmware quality. SQL Analytics allows customers to operate a lakehouse architecture that provides data warehousing performance at data lake economics for up to 4x better price/performance for SQL workloads than traditional cloud data warehouses. We present the results of the “Test and Learn” with SQL Analytics and the delta engine that we worked in partnership with the Databricks team. We present a quick demo introducing the SQL native interface, the challenges we faced with migration, The results of the execution and our journey of productionizing this at scale.

Big Data Redis Mongodb Dynamodb Sharding

Araf Karsh Hamid

リアルタイムアクセスログ分析基盤をAWSに構築した話 (JAWS UG BigData Branch)

Hajime Sano

What is in All of Those SSTable Files Not Just the Data One but All the Rest ...

DataStax

Have you ever wondered what is in all of those SSTable files and how it helps Cassandra find and manage your data? If you go to the Datastax website they will give you a high level explanation of what is in each file. In this talk we will go much deeper explaining each file and walking through a dump of its contents. We will also explore the differences between Cassandra 2.1 and 3.4. About the Speaker John Schulz Prinicipal Consultant, The Pythian Group John has 40 of years experience working with data. Data in files and in Databases from flat files through ISAM to relational databases and most recently NoSQL. For the last 15 he's worked on a variety of Open source technologies including MySQL, PostgreSQL, Cassandra, Riak, Hadoop and Hbase. He has been working with Cassandra since 2010. For the last eighteen months he has been working for The Pythian Group to help their customers improve their existing databases and select new ones.

Web Server Scheduling

David Evans

State of Azure Sql Database

Marco Parenzan

How Cloudflare analyzes -1m dns queries per second @ Percona E17

Tom Arnfeld

Splunk talk at the AWS Big Data Meetup in Palo Alto on Nov 17 2015

stevemcpherson

아마존의 딥러닝 기술 활용 사례 - 윤석찬 (AWS 테크니컬 에반젤리스트)

Amazon Web Services Korea

아마존닷컴은 쇼핑 상품 추천, 배송 및 물류 예측 등에 기계 학습 기술을 활용해 왔으며, 최근 프라임 서비스를 위한 음악, 이미지, 영상 인식, 무인 매장인 아마존고 및 음성 비서 서비스인 알렉사에 딥러닝 기술을 활용하고 있다. 본 세션에서는 이러한 주요 딥러닝 활용 기술 사례를 알아보고, AWS 클라우드를 통해 제공하는 이미지/영상 인식, 음성 인식 및 합성, 기계 번역, 자연어 처리 등 다양한 딥러닝 기반 서비스 구현 방법을 살펴본다. 개발자들이 직접 딥러닝 기반 데이터 처리, 모델 학습 및 서비스 배포까지 손쉽게 구성할 수 있는 Amazon SageMaker와 Deep Lens를 통해 어떻게 IoT 기반 서비스로 활용할 수 있는지 시연을 통해 알아본다.

NetApp Insight 2015 Berlin Sponsors Guide

NetApp Insight

Producing Spark on YARN for ETL

DataWorks Summit/Hadoop Summit

Spark on Yarn @ Netflix

Nezih Yigitbasi

Adios hadoop, Hola Spark! T3chfest 2015

dhiguero

Akka, Spark or Kafka? Selecting The Right Streaming Engine For the Job

Lightbend

For many businesses, the batch-oriented architecture of Big Data–where data is captured in large, scalable stores, then processed later–is simply too slow: a new breed of “Fast Data” architectures has evolved to be stream-oriented, where data is processed as it arrives, providing businesses with a competitive advantage. There are many stream processing tools, so which ones should you choose? It helps to consider several factors in the context of your applications: * Low latency: How low (or high) is needed? * High volume: How much volume must be handled? * Integration with other tools: Which ones and how? * Data processing: What kinds? In bulk? As individual events? In this talk by Dean Wampler, PhD., VP of Fast Data Engineering at Lightbend, we’ll look at the criteria you need to consider when selecting technologies, plus specific examples of how four streaming tools–Akka Streams, Kafka Streams, Apache Flink and Apache Spark serve particular needs and use cases when working with continuous streams of data.

What's new with Azure Sql Database

Marco Parenzan

Similar to Next Generation Big Data Platform at Netflix 2014 (20)

AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...

(ARC317) Maintaining a Resilient Front Door at Massive Scale | AWS re:Invent ...

Data Science with Elastic MapReduce (EMR) at Netflix

Scaling Data Quality @ Netflix

Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix

SQL Analytics Powering Telemetry Analysis at Comcast

Big Data Redis Mongodb Dynamodb Sharding

リアルタイムアクセスログ分析基盤をAWSに構築した話 (JAWS UG BigData Branch)

What is in All of Those SSTable Files Not Just the Data One but All the Rest ...

Web Server Scheduling

State of Azure Sql Database

How Cloudflare analyzes -1m dns queries per second @ Percona E17

Splunk talk at the AWS Big Data Meetup in Palo Alto on Nov 17 2015

아마존의 딥러닝 기술 활용 사례 - 윤석찬 (AWS 테크니컬 에반젤리스트)

NetApp Insight 2015 Berlin Sponsors Guide

Producing Spark on YARN for ETL

Spark on Yarn @ Netflix

Adios hadoop, Hola Spark! T3chfest 2015

Akka, Spark or Kafka? Selecting The Right Streaming Engine For the Job

What's new with Azure Sql Database

Recently uploaded

UiPath Test Automation using UiPath Test Suite series, part 3

DianaGray10

GraphRAG is All You need? LLM & Knowledge Graph

Guy Korland

Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs. 1. Unifying Large Language Models and Knowledge Graphs: A Roadmap. https://arxiv.org/abs/2306.08302 2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs: https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/

Connector Corner: Automate dynamic content and events by pushing a button

DianaGray10

Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to: Create a campaign using Mailchimp with merge tags/fields Send an interactive Slack channel message (using buttons) Have the message received by managers and peers along with a test email for review But there’s more: In a second workflow supporting the same use case, you’ll see: Your campaign sent to target colleagues for approval If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team But—if the “Reject” button is pushed, colleagues will be alerted via Slack message Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors. And... Speakers: Akshay Agnihotri, Product Manager Charlie Greenberg, Host

ODC, Data Fabric and Architecture User Group

CatarinaPereira64715

Bits & Pixels using AI for Good.........

Alison B. Lowndes

Leading Change strategies and insights for effective change management pdf 1.pdf

OnBoard

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

DanBrown980551

Do you want to learn how to model and simulate an electrical network from scratch in under an hour? Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)! During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook. PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides: - A fully editable and extendable library for grid component modelling; - Visualization tools to display your network; - Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses; The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well. What you will learn during the webinar: - For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills; - For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.

Assuring Contact Center Experiences for Your Customers With ThousandEyes

ThousandEyes

Neuro-symbolic is not enough, we need neuro-*semantic*

Frank van Harmelen

Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”. All of this illustrated with link prediction over knowledge graphs, but the argument is general.

UiPath Test Automation using UiPath Test Suite series, part 4

DianaGray10

Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap. The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies. Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques What will you get from this session? 1. Insights into SAP testing best practices 2. Heatmap utilization for testing 3. Optimization of testing processes 4. Demo Topics covered: Execution from the test manager Orchestrator execution result Defect reporting SAP heatmap example with demo Speaker: Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP

JMeter webinar - integration with InfluxDB and Grafana

RTTS

Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application. In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics. Length: 30 minutes Session Overview ------------------------------------------- During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana: - What out-of-the-box solutions are available for real-time monitoring JMeter tests? - What are the benefits of integrating InfluxDB and Grafana into the load testing stack? - Which features are provided by Grafana? - Demonstration of InfluxDB and Grafana using a practice web application To view the webinar recording, go to: https://www.rttsweb.com/jmeter-integration-webinar

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...

Product School

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

Inflectra

In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring. Learn about: • The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks. • Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective. • Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification. • Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process. Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

FIDO Alliance

When stars align: studies in data quality, knowledge graphs, and machine lear...

Elena Simperl

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

UiPathCommunity

💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™: See how to accelerate model training and optimize model performance with active learning Learn about the latest enhancements to out-of-the-box document processing – with little to no training required Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath. Speakers: 👨‍🏫 Andras Palfi, Senior Product Manager, UiPath 👩‍🏫 Lenka Dulovicova, Product Program Manager, UiPath

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...

Product School

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

Sri Ambati

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

Tobias Schneck

As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other? Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.

The Future of Platform Engineering

Jemma Hussein Allen

Recently uploaded (20)

UiPath Test Automation using UiPath Test Suite series, part 3

GraphRAG is All You need? LLM & Knowledge Graph

Connector Corner: Automate dynamic content and events by pushing a button

ODC, Data Fabric and Architecture User Group

Bits & Pixels using AI for Good.........

Leading Change strategies and insights for effective change management pdf 1.pdf

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

Assuring Contact Center Experiences for Your Customers With ThousandEyes

Neuro-symbolic is not enough, we need neuro-*semantic*

UiPath Test Automation using UiPath Test Suite series, part 4

JMeter webinar - integration with InfluxDB and Grafana

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

When stars align: studies in data quality, knowledge graphs, and machine lear...

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

The Future of Platform Engineering

Next Generation Big Data Platform at Netflix 2014

1. Eva Tse, Netflix November 12, 2014 | Las Vegas, LV

9. Cloud apps Event Data Suro Ursula Cassandra Aegisthus 15 min Dimension Data Daily S3 SS Tables

10. Storage Compute Service Tools S3

11. Storage Compute Service Tools S3 v2.0

12.

13.

14. • Works well on S3

15.

16.

17.

18.

19. YARN-1864 YARN-2026 YARN-2012 YARN-2214 YARN-2360 YARN-2540

20. S3

21. S3

22.

23.

24. TezCompiler MRCompiler Tez Plan Logical Plan Physical Plan Tez Execution Engine MR Plan MR Execution Engine d

25.

26.

27. A Distributed SQL Query Engine for Big Data

28. techblog.netflix.com

29.

30.

31. 21 committed PRs and 14 PRs in review

32. S3

33. v2.0

34.

35.

36. techblog.netflix.com

37.

38.

39.

40.

41.

42.

43.

44. Storage Compute Service Tools d S3 v2.0

45.

46.

47.

48.

49. YARN-1864 YARN-2026 YARN-2012 YARN-2214 YARN-2360 YARN-2540 HIVE-6783 HIVE-6785 HIVE-6938 HIVE-7800 PARQUET-100 PARQUET-106 PARQUET-2 PARQUET-22 PARQUET-70 PARQUET-75 PARQUET-92 PARQUET-99 PIG-3986

50.

51.

52.

53.

54.

55.

56.

57.

58.

59. Talk Time Title PFC-305 Wednesday, 1:15pm Embracing Failure: Fault Injection and Service Reliability BDT-403 Wednesday, 2:15pm Next Generation Big Data Platform at Netflix PFC-306 Wednesday, 3:30pm Performance Tuning EC2 DEV-309 Wednesday, 3:30pm From Asgard to Zuul, How Netflix’s proven Open Source Tools can accelerate and scale your services ARC-317 Wednesday, 4:30pm Maintaining a Resilient Front-Door at Massive Scale PFC-304 Wednesday, 4:30pm Effective Inter-process Communications in the Cloud: The Pros and Cons of Micro Services Architectures ENT-209 Wednesday, 4:30pm Cloud Migration, Dev-Ops and Distributed Systems APP-310 Friday, 9:00am Scheduling using Apache Mesos in the Cloud

Editor's Notes

4 mins
.
10 mins
15 mins
19 mins
22 mins – huge win for us!
28 mins
30 mins
35 - 36 mins

Next Generation Big Data Platform at Netflix 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Next Generation Big Data Platform at Netflix 2014

Similar to Next Generation Big Data Platform at Netflix 2014 (20)

Recently uploaded

Recently uploaded (20)

Next Generation Big Data Platform at Netflix 2014

Editor's Notes