The Past, Present and Future of Big Data @LinkedIn

•

5 likes•545 views

LinkedIn processes huge amounts of data from user events across the globe at scale. They collect 2.3 trillion messages per day totaling 2.5 PB of data and process it using highly reliable fault tolerant batch and stream processing. They access this data by persisting it durably across 120 PB of HDFS storage and make it searchable and available for online services. Their analytics infrastructure includes data ingestion using Gobblin, dataset management using Dali, storage using HDFS and Voldemort, and compute engines like YARN. They use solutions like federated HDFS, Dali, Hadoop OrgQueue and elasticity tuning to scale their system, cluster management and computation across their infrastructure of tens of thousands of nodes

Technology

SRE
Bruno Connelly
#LinkedInWIT
The Past, Present and Future of Big
Data @ LinkedIn

People You May Know
Suja Viswesan
SR ENGINEERING MANAGER, BIG DATA PLATFORM

MEMBERS COMPANIES JOBS SKILLS SCHOOLS KNOWLEDGE

Scale of Processing @
2.3 Trillion
Messages per Day
0.6 PB in 2.3 PB out
per Day (compressed)
16 Million
Messages per Second at peaks!
4.6K users
125 TB ingested per day
120 PB of HDFS
224K jobs per day across
13 clusters (9 K nodes)
220+ Applications
Most Applications require
Stateful Processing ~
several TBs (overall)
800+ nodes across 9
clusters
samza

Big Data!
Collect
- Collect User Events from
Across the Globe
- Eg. Page Views, Feed
Impressions, Connections
- Multiple Sources of Data
- Transport Data with Low
Latency
- Scale - 2.3 trillion msgs/day
(~2.5 PB) (Pymk Scale ~10K
msg/sec)

Analytics Infrastructure
Gobblin
Espresso
Data
Sources
3rd Party
Services
Data
Ingestion
Oracle DB
HDFS
Voldemort
Data
Storage
Dataset
Management
Dali
Datasets

Analytics Infrastructure
A/B
testing
Cluster
Management
Compute
Engines
Workflow
Orchestration
Usecases
Relevance
Analytics
Reporting
YARN Azkaban

Analytics Infrastructure Challenges
Computation
Cluster Management
System
Scaling up computation
● Limited shared computation resources
● Efficient computation to cut down cost of jobs
Scaling up cluster management
● Thousands of daily active cluster users
● Hundreds of thousands of daily jobs
● A mix of SLA requirements
Scaling up system
● Tens of thousands of nodes
● Tens of PT of data
THESCALINGPYRAMID

Our Solutions
Scaling up system
● Federated HDFS
● Dali - Logical Data Access Layer for Hadoop
Scaling up cluster management
● Hadoop OrgQueue
● Elasticity Tuner
Scaling up computation
● Dr. Elephant
● Better computation strategy for handling large datasets

LinkedIn Open Source Projects
Pinot
Dr Elephant
Cubert
Streaming
Near Realtime
Stream Processing
Data Management Performance Tuning OLAP Storage
Computation EngineWorkflow Manager
samza
Photon - ML

Bruno Connelly
See you at Grace Hopper Celebration!

What's hot

LinkedIn has several data driven products that improve the experience of its users -- whether they are professionals or enterprises. Supporting this is a large ecosystem of systems and processes that provide data and insights in a timely manner to the products that are driven by it. This talk provides an overview of the various components of this ecosystem which are: - Hadoop - Teradata - Kafka - Databus - Camus - Lumos etc.

The Big Data Analytics Ecosystem at LinkedIn

rajappaiyer

This Edureka "Hadoop Tutorial" ( Hadoop Blog series: https://goo.gl/LFesy8 ) will help you to solve Big Data use-cases just like a data analyst. You will learn all the concepts of both Hadoop & Spark. You will also learn k means clustering and zeppelin to visualize your data. Below are the topics covered in this tutorial: 1. Big Data Use Cases - US Election & Instant Cabs 2. Solution strategy of the use cases 3. Hadoop & Spark Introduction 4. Hadoop Master/Slave Architecture 5. Hadoop Core Components 6. HDFS Data Blocks 7. HDFS Read/Write Mechanism 8.YARN Components 9. Spark Components 10.Spark Architecture 11. K-Means and Zeppelin 12.Implementing Solution of the use cases using Hadoop, Spark and other big data tools.

Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certificati...

Edureka!

With so many new technologies it can get confusing on the best approach to building a big data architecture. The data lake is a great new concept, usually built in Hadoop, but what exactly is it and how does it fit in? In this presentation I'll discuss the four most common patterns in big data production implementations, the top-down vs bottoms-up approach to analytics, and how you can use a data lake and a RDBMS data warehouse together. We will go into detail on the characteristics of a data lake and its benefits, and how you still need to perform the same data governance tasks in a data lake as you do in a data warehouse. Come to this presentation to make sure your data lake does not turn into a data swamp!

Big data architectures and the data lake

James Serra

This presentation about Hadoop architecture will help you understand the architecture of Apache Hadoop in detail. In this video, you will learn what is Hadoop, components of Hadoop, what is HDFS, HDFS architecture, Hadoop MapReduce, Hadoop MapReduce example, Hadoop YARN and finally, a demo on MapReduce. Apache Hadoop offers a versatile, adaptable and reliable distributed computing big data framework for a group of systems with capacity limit and local computing power. After watching this video, you will also understand the Hadoop Distributed File System and its features along with the practical implementation. Below are the topics covered in this Hadoop Architecture presentation: 1. What is Hadoop? 2. Components of Hadoop 3. What is HDFS? 4. HDFS Architecture 5. Hadoop MapReduce 6. Hadoop MapReduce Example 7. Hadoop YARN 8. Demo on MapReduce What are the course objectives? This course will enable you to: 1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark 2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management 3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts 4. Get an overview of Sqoop and Flume and describe how to ingest data using them 5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning 6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution 7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations 8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS 9. Gain a working knowledge of Pig and its components 10. Do functional programming in Spark 11. Understand resilient distribution datasets (RDD) in detail 12. Implement and build Spark applications 13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques 14. Understand the common use-cases of Spark and the various interactive algorithms 15. Learn Spark SQL, creating, transforming, and querying Data frames Who should take up this Big Data and Hadoop Certification Training Course? Big Data career opportunities are on the rise, and Hadoop is quickly becoming a must-know technology for the following professionals: 1. Software Developers and Architects 2. Analytics Professionals 3. Senior IT professionals 4. Testing and Mainframe professionals 5. Data Management Professionals 6. Business Intelligence Professionals 7. Project Managers 8. Aspiring Data Scientists Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training

Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...

Simplilearn

Google BigQuery

Matthias Feys

Struggling to keep up with an ever-increasing demand for data at your organisation? Do you spend hours tinkering with your streaming data pipelines? Does that one data scientist with direct EDW access keep you up at night? Introducing Snowflake, a brand new SQL data warehouse built for the cloud. We’ve designed and implemented a unique cloud-based architecture that addresses the most common shortcomings of existing data solutions. With Snowflake, you can unlock unlimited concurrency, enable instant scalability, and take advantage of built-in tuning and optimisation. Join us and find out what Netflix, Adobe, and Nike all have in common.

Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...

Amazon Web Services

Spark SQL

Joud Khattab

Real-time Analytics with Trino and Apache Pinot

Xiang Fu

What is NoSQL and CAP Theorem

Rahul Jain

Business leads, executives, analysts, and data scientists rely on up-to-date information to make business decision, adjust to the market, meet needs of their customers or run effective supply chain operations. Come hear how Asurion used Delta, Structured Streaming, AutoLoader and SQL Analytics to improve production data latency from day-minus-one to near real time Asurion’s technical team will share battle tested tips and tricks you only get with certain scale. Asurion data lake executes 4000+ streaming jobs and hosts over 4000 tables in production Data Lake on AWS.

Large Scale Lakehouse Implementation Using Structured Streaming

Databricks

Amazon Aurora is a MySQL-compatible relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. Amazon Aurora is disruptive technology in the database space, bringing a new architectural model and distributed systems techniques to provide far higher performance, availability and durability than previously available using conventional monolithic database techniques. In this session, we will do a deep-dive into some of the key innovations behind Amazon Aurora, discuss best practices and configurations, and share early customer experience from the field.

Deep Dive on Amazon Aurora

Amazon Web Services

The number of daily Apache Spark applications at LinkedIn has increased by 3X in the past year. The shuffle process alone, which is one of the most costly operators in batch computation, is processing PBs of data and billions of blocks daily in our clusters. With such a rapid increase of Apache Spark workloads, we quickly realized that the shuffle process can become a severe bottleneck for both infrastructure scalability and workloads efficiency. In our production clusters, we have observed both reliability issues due to shuffle fetch connection failures and efficiency issues due to the random reads of small shuffle blocks on HDDs. To tackle those challenges and optimize shuffle performance in Apache Spark, we have developed Magnet shuffle service, a push-based shuffle mechanism that works natively with Apache Spark. Our paper on Magnet has been accepted by VLDB 2020. In this talk, we will introduce how push-based shuffle can drastically increase shuffle efficiency when compared with the existing pull-based shuffle. In addition, by combining push-based shuffle and pull-based shuffle, we show how Magnet shuffle service helps to harden shuffle infrastructure at LinkedIn scale by both reducing shuffle related failures and removing scaling bottlenecks. Furthermore, we will share our experiences of productionizing Magnet at LinkedIn to process close to 10 PB of daily shuffle data.

Magnet Shuffle Service: Push-based Shuffle at LinkedIn

Databricks

Apache Spark Internals

Knoldus Inc.

This presentation about Big Data will help you understand how Big Data evolved over the years, what is Big Data, applications of Big Data, a case study on Big Data, 3 important challenges of Big Data and how Hadoop solved those challenges. The case study talks about Google File System (GFS), where you’ll learn how Google solved its problem of storing increasing user data in early 2000. We’ll also look at the history of Hadoop, its ecosystem and a brief introduction to HDFS which is a distributed file system designed to store large volumes of data and MapReduce which allows parallel processing of data. In the end, we’ll run through some basic HDFS commands and see how to perform wordcount using MapReduce. Now, let us get started and understand Big Data in detail. Below topics are explained in this Big Data presentation for beginners: 1. Evolution of Big Data 2. Why Big Data? 3. What is Big Data? 4. Challenges of Big Data 5. Hadoop as a solution 6. MapReduce algorithm 7. Demo on HDFS and MapReduce What is this Big Data Hadoop training course about? The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab. What are the course objectives? This course will enable you to: 1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark 2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management 3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts 4. Get an overview of Sqoop and Flume and describe how to ingest data using them 5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning 6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution 7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations 8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS 9. Gain a working knowledge of Pig and its components 10. Do functional programming in Spark 11. Understand resilient distribution datasets (RDD) in detail 12. Implement and build Spark applications 13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques 14. Understand the common use-cases of Spark and the various interactive algorithms 15. Learn Spark SQL, creating, transforming, and querying Data frames Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training

Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...

Simplilearn

This is the first in a series of five webinars that look 'under the covers' of Denodo's industry leading Data Virtualization Platform. The webinar will provide an overview of the architecture and key modules of the Denodo Platform - subsequent webinars in the series will take a deeper look at some of the key modules and capabilities of the platform, including performance, scalability, security, and so on. More information and FREE registrations to this webinar: http://goo.gl/fLi2bC To learn more click to this link: http://go.denodo.com/a2a Join the conversation at #Architect2Architect Agenda: The Denodo Platform Platform Architecture Key Modules Connectors Data Services and APIs

Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...

Denodo

Advanced SQL For Data Scientists

Databricks

Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis. Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them. Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy. This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.

Processing Large Datasets for ADAS Applications using Apache Spark

Databricks

Big Data Technologies.pdf

RAHULRAHU8

At wetter.com we build analytical B2B data products and heavily use Spark and AWS technologies for data processing and analytics. I explain why we moved from AWS EMR to Databricks and Delta and share our experiences from different angles like architecture, application logic and user experience. We will look how security, cluster configuration, resource consumption and workflow changed by using Databricks clusters as well as how using Delta tables simplified our application logic and data operations.

Moving to Databricks & Delta

Databricks

Big data

Nausheen Hasan

What's hot (20)

The Big Data Analytics Ecosystem at LinkedIn

Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certificati...

Big data architectures and the data lake

Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...

Google BigQuery

Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...

Spark SQL

Real-time Analytics with Trino and Apache Pinot

What is NoSQL and CAP Theorem

Large Scale Lakehouse Implementation Using Structured Streaming

Deep Dive on Amazon Aurora

Magnet Shuffle Service: Push-based Shuffle at LinkedIn

Apache Spark Internals

Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...

Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...

Advanced SQL For Data Scientists

Processing Large Datasets for ADAS Applications using Apache Spark

Big Data Technologies.pdf

Moving to Databricks & Delta

Big data

Similar to The Past, Present and Future of Big Data @LinkedIn

Inroduction to Big Data

Omnia Safaan

Spark Streaming and IoT by Mike Freedman

Spark Summit

Learn how Verizon is adopting the Amazon Aurora PostgreSQL-compatible edition for their mission-critical applications. Verizon has a history of adopting best of breed database technologies as they continue to serve their 140M+ customers. As Verizon moves its enterprise applications to the cloud, database performance and reliability are the key considerations. With heavy dependence on commercial databases, learn how a large enterprise like Verizon evaluated performance, reliability and operational characteristics of Amazon Aurora, and was able to create internal momentum behind adoption of open source technologies by showcasing early wins. This session also highlights best practices for using Amazon Aurora and the newly-announced RDS Performance Insights.

DAT332_How Verizon is Adopting Amazon Aurora PostgreSQL for Enterprise Workloads

Amazon Web Services

Kafka & Hadoop in Rakuten

Rakuten Group, Inc.

BDX 2016- Monal daxini @ Netflix

Ido Shilon

The traditional lambda architecture has been a popular solution for joining offline batch operations with real time operations. This setup incurs a lot of developer and operational overhead since it involves maintaining code that produces the same result in two, potentially different distributed systems. In order to alleviate these problems, we need a unified framework for processing and building data pipelines across batch and stream data sources. Based on our experiences running and developing Apache Samza at LinkedIn, we have enhanced the framework to support: a) Pluggable data sources and sinks; b) A deployment model supporting different execution environments such as Yarn or VMs; c) A unified processing API for developers to work seamlessly with batch and stream data. In this talk, we will cover how these design choices in Apache Samza help tackle the overhead of lambda architecture. We will use some real production use-cases to elaborate how LinkedIn leverages Apache Samza to build unified data processing pipelines. Speaker Navina Ramesh, Sr. Software Engineer, LinkedIn

Unified Batch & Stream Processing with Apache Samza

DataWorks Summit

Vikram Andem Big Data Strategy @ IATA Technology Roadmap

IT Strategy Group

eBay marketplace has been working hard on the next generation search infrastructure and software system, code-named Cassini. The new search engine processes over 250 million search queries and serves more than 2 billion page views each day. Its indexing platform is based on Apache Hadoop and Apache HBase. Apache HBase is a distributed persistent layer built on Hadoop to support billions of updates per day. Its easy sharding character, fast writes, and table scans, super fast data bulk load, and natural integration to Hadoop provide the cornerstones for successful continuous index builds. We will share with the audience the technical details and share the difficulties and challenges that we’ve gone through and that we are still facing in the process.

HBaseCon 2012 | HBase, the Use Case in eBay Cassini

Cloudera, Inc.

Presto, an open source distributed SQL engine originally built at Facebook, has a rapidly growing community of developers and users. In this talk, speakers from both Facebook and Teradata, will discuss technical details of some of the recent developments such as integration with Hadoop ecosystem (YARN/Slider and Ambari), security features (Kerberos), enabling BI tools via JDBC/ODBC drivers, new connectors (Redis, MongoDB) and storage engines (Raptor) as well as improvements in performance and ANSI SQL coverage. In addition, we will present a few use cases and major new users that leverage interactive SQL capabilities Presto offers. Finally, we will present our roadmap for the next year. See the video at https://youtu.be/wMy3LXuTb0U

Presto at Hadoop Summit 2016

kbajda

Asko Oja Moskva Architecture Highload

Ontico

Back in 2014, our team set out to change the way the world exchanges and collaborates with data. Our vision was to build a single tenant environment for multiple organisations to securely share and consume data. And we did just that, leveraging multiple Hadoop technologies to help our infrastructure scale quickly and securely. Today Data Republic’s technology delivers a trusted platform for hundreds of enterprise level companies to securely exchange, commercialise and collaborate with large datasets. Join Head of Engineering, Juan Delard de Rigoulières and Senior Solutions Architect, Amin Abbaspour as they share key lessons from their team’s journey with Hadoop: * How a startup leveraged a clever combination of Hadoop technologies to build a secure data exchange platform * How Hadoop technologies helped us deliver key solutions around governance, security and controls of data and metadata * An evaluation on the maturity and usefulness of some Hadoop technologies in our environment: Hive, HDFS, Spark, Ranger, Atlas, Knox, Kylin: we've use them all extensively. * Our bold approach to expose APIs directly to end users; as well as the challenges, learning and code we created in the process * Learnings from the front-line: How our team coped with code changes, performance tuning, issues and solutions while building our data exchange Whether you’re an enterprise level business or a start-up looking to scale - this case study discussion offers behind-the-scenes lessons and key tips when using Hadoop technologies to manage data governance and collaboration in the cloud. Speakers: Juan Delard De Rigoulieres, Head of Engineering, Data Republic Pty Ltd Amin Abbaspour, Senior Solutions Architect, Data Republic

Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...

DataWorks Summit

Hadoop introduction

Subhas Kumar Ghosh

NASA LandSat data can be stored, transformed, navigated, and visualized. In this session we will explore how the LandSat dataset is stored in Amazon Simple Storage Service (S3), one of the recommended cloud storage services in AWS for storage of petabytes of data, and how data stored in S3 can be processed on the server with the Lambda service, visualized for users, and made available to search engines. Create by: Ben Snively, Senior Solutions Architect

Real-Time Event Processing

Amazon Web Services

Traditional data warehouses become expensive and slow down as the volume of your data grows. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it easy to analyze all of your data using existing business intelligence tools for 1/10th the traditional cost. This session will provide an introduction to Amazon Redshift and cover the essentials you need to deploy your data warehouse in the cloud so that you can achieve faster analytics and save costs. We’ll also cover the recently announced Redshift Spectrum, which allows you to query unstructured data directly from Amazon S3.

Getting Started with Amazon Redshift

Amazon Web Services

Amazon S3 is the central data hub for Netflix's big data ecosystem. We currently have over 1.5 billion objects and 60+ PB of data stored in S3. As we ingest, transform, transport, and visualize data, we find this data naturally weaving in and out of S3. Amazon S3 provides us the flexibility to use an interoperable set of big data processing tools like Spark, Presto, Hive, and Pig. It serves as the hub for transporting data to additional data stores / engines like Teradata, Redshift, and Druid, as well as exporting data to reporting tools like Microstrategy and Tableau. Over time, we have built an ecosystem of services and tools to manage our data on S3. We have a federated metadata catalog service that keeps track of all our data. We have a set of data lifecycle management tools that expire data based on business rules and compliance. We also have a portal that allows users to see the cost and size of their data footprint. In this talk, we’ll dive into these major uses of S3, as well as many smaller cases, where S3 smoothly addresses an important data infrastructure need. We will also provide solutions and methodologies on how you can build your own S3 big data hub.

AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...

Amazon Web Services

Big data remains a rapidly evolving field with new applications and infrastructure appearing every year. In this talk, Matei Zaharia will cover new trends in 2016 / 2017 and how Apache Spark is moving to meet them. In particular, he will talk about work Databricks is doing to make Apache Spark interact better with native code (e.g. deep learning libraries), support heterogeneous hardware, and simplify production data pipelines in both streaming and batch settings through Structured Streaming. Speaker: Matei Zaharia Video: http://go.databricks.com/videos/spark-summit-east-2017/what-to-expect-big-data-apache-spark-2017 This talk was originally presented at Spark Summit East 2017.

What to Expect for Big Data and Apache Spark in 2017

Databricks

Amazon Aurora is a MySQL-compatible database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. Amazon Aurora is available through Amazon RDS as a fully managed database service. This webinar introduces you to Amazon Aurora, explains common use cases for the service, and discusses methods to migrate your MySQL databases that are on Amazon RDS, Amazon EC2 or on-premises to Amazon Aurora. Learning Objectives: How Amazon Aurora is different and similar to traditional databases Reliability and availability design in Aurora How Amazon Aurora delivers up to 5x MySQL performance on similar hardware Learn the scalability in Amazon Aurora: scaling instance size and database size, horizontal scaling with read replicas Who Should Attend: IT Managers, DBAs, Enterprise and Solution Architects , Devops Engineers and Developers

AWS December 2015 Webinar Series - Amazon Aurora: Introduction and Migration

Amazon Web Services

Thousands of services work in concert to deliver millions of hours of video streams to Netflix customers every day. These applications vary in size, function, and technology, but they all make use of the Netflix network to communicate. Understanding the interactions between these services is a daunting challenge both because of the sheer volume of traffic and the dynamic nature of deployments. In this talk, we’ll first discuss why Netflix chose Amazon Kinesis Streams over other data streaming solutions like Kafka to address these challenges at scale. We’ll then dive deep into how Netflix uses Amazon Kinesis Streams to enrich network traffic logs and identify usage patterns in real time. Lastly, we will cover how Netflix uses this system to build comprehensive dependency maps, increase network efficiency, and improve failure resiliency. From this talk, you’ll take away techniques and processes that you can apply to your large-scale networks and derive real-time, actionable insights.

BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...

Amazon Web Services

You run your SQL-centric infrastructure for 10 years and slowly starting to note you can’t do this way anymore – everything is getting too expensive but your business requires things which are simply impossible without radical changes. This is exact situation we had 2 years before. So we’d like to show our experience: - Why and how we came into Big Data? - Why we choose Apache and Hadoop? - What to do and what is already done? - What lessons were learned? - Hadoop and relational databases: fight or synergy? - Reactive Big Data manifest.

BIG DATA: From mammoth to elephant

Roman Nikitchenko

Atzmon Hen-Tov & Lior Schachter, Pontis Businesses everywhere are increasingly challenged by their dependencies on legacy platforms. The dramatic increase in data volume, speed, and types of data is quickly outstripping the capabilities of these legacy systems. By transitioning from a legacy RDBMS to a Hadoop-based platform, Pontis was able to process and analyze billions of mobile subscriber events every day. In this talk, we’ll provide a quick overview of our legacy system, as well as our process for migrating to our target architecture. We’ll continue with a review our Hadoop platform selection process, which involved a thorough RFP and a detailed analysis of the top Hadoop platform vendors. This session will focus on how we gradually transitioned to our big data platform over the course of several product versions, resulting in higher scalability and a lower TCO in each version. We’ll outline the benefits of the target architecture, and detail how we successfully integrated Hadoop into our organization. Our session will conclude with a look at technical solutions for dealing with big data deficiencies.

Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...

MapR Technologies

Similar to The Past, Present and Future of Big Data @LinkedIn (20)

Inroduction to Big Data

Spark Streaming and IoT by Mike Freedman

DAT332_How Verizon is Adopting Amazon Aurora PostgreSQL for Enterprise Workloads

Kafka & Hadoop in Rakuten

BDX 2016- Monal daxini @ Netflix

Unified Batch & Stream Processing with Apache Samza

Vikram Andem Big Data Strategy @ IATA Technology Roadmap

HBaseCon 2012 | HBase, the Use Case in eBay Cassini

Presto at Hadoop Summit 2016

Asko Oja Moskva Architecture Highload

Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...

Hadoop introduction

Real-Time Event Processing

Getting Started with Amazon Redshift

AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...

What to Expect for Big Data and Apache Spark in 2017

AWS December 2015 Webinar Series - Amazon Aurora: Introduction and Migration

BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...

BIG DATA: From mammoth to elephant

Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...

Recently uploaded

I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.

"Impact of front-end architecture on development cost", Viktor Turskyi

Fwdays

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

FIDO Alliance

To Graph or Not to Graph Knowledge Graph Architectures and LLMs

Paul Groth

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

FIDO Alliance

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

Product School

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx

Abida Shariff

I'm excited to share my latest predictions on how AI, robotics, and other technological advancements will reshape industries in the coming years. The slides explore the exponential growth of computational power, the future of AI and robotics, and their profound impact on various sectors. Why this matters: The success of new products and investments hinges on precise timing and foresight into emerging categories. This deck equips founders, VCs, and industry leaders with insights to align future products with upcoming tech developments. These insights enhance the ability to forecast industry trends, improve market timing, and predict competitor actions. Highlights: ▪ Exponential Growth in Compute: How $1000 will soon buy the computational power of a human brain ▪ Scaling of AI Models: The journey towards beyond human-scale models and intelligent edge computing ▪ Transformative Technologies: From advanced robotics and brain interfaces to automated healthcare and beyond ▪ Future of Work: How automation will redefine jobs and economic structures by 2040 With so many predictions presented here, some will inevitably be wrong or mistimed, especially with potential external disruptions. For instance, a conflict in Taiwan could severely impact global semiconductor production, affecting compute costs and related advancements. Nonetheless, these slides are intended to guide intuition on future technological trends.

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl

Peter Udo Diehl

In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development. This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.

PHP Frameworks: I want to break free (IPC Berlin 2024)

Ralf Eggert

Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to: Create a campaign using Mailchimp with merge tags/fields Send an interactive Slack channel message (using buttons) Have the message received by managers and peers along with a test email for review But there’s more: In a second workflow supporting the same use case, you’ll see: Your campaign sent to target colleagues for approval If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team But—if the “Reject” button is pushed, colleagues will be alerted via Slack message Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors. And... Speakers: Akshay Agnihotri, Product Manager Charlie Greenberg, Host

Connector Corner: Automate dynamic content and events by pushing a button

DianaGray10

Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows. We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases. This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams. Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...

Jeffrey Haguewood

When stars align: studies in data quality, knowledge graphs, and machine lear...

Elena Simperl

Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application. In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics. Length: 30 minutes Session Overview ------------------------------------------- During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana: - What out-of-the-box solutions are available for real-time monitoring JMeter tests? - What are the benefits of integrating InfluxDB and Grafana into the load testing stack? - Which features are provided by Grafana? - Demonstration of InfluxDB and Grafana using a practice web application To view the webinar recording, go to: https://www.rttsweb.com/jmeter-integration-webinar

JMeter webinar - integration with InfluxDB and Grafana

RTTS

In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring. Learn about: • The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks. • Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective. • Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification. • Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process. Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

Inflectra

Ever caught yourself nodding along when someone mentions "delivering value" in Agile, but secretly wondering what the heck they actually mean? You're not alone! Join us for an eye-opening session where we'll strip away the buzzwords and dive into the heart of Agile—value delivery. But what is "value"? Is it a mythical unicorn in the world of software development, or is there more to this overused term? This isn't going to be a sit-and-get lecture. We're talking about a face-to-face, interactive meetup where YOU play a crucial role. Come along to: Define It: What does "value" really mean? We’ll build a definition that’s not just words, but a compass for your Agile journey. Contextualise It: Discover what value means specifically to you, your team, your company, and your industry. Because one size does not fit all. Deliver It: Share strategies and gather new ones for uncovering and delivering true value—no more shooting in the dark!

Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx

David Michel

Knowledge engineering: from people to machines and back

Elena Simperl

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Product School

As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other? Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

Tobias Schneck

The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more. Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/ Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

BookNet Canada

Key Trends Shaping the Future of Infrastructure.pdf

Cheryl Hung

In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...

Ramesh Iyer

Recently uploaded (20)

"Impact of front-end architecture on development cost", Viktor Turskyi

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

To Graph or Not to Graph Knowledge Graph Architectures and LLMs

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl

PHP Frameworks: I want to break free (IPC Berlin 2024)

Connector Corner: Automate dynamic content and events by pushing a button

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...

When stars align: studies in data quality, knowledge graphs, and machine lear...

JMeter webinar - integration with InfluxDB and Grafana

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx

Knowledge engineering: from people to machines and back

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

Key Trends Shaping the Future of Infrastructure.pdf

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...

The Past, Present and Future of Big Data @LinkedIn

1. SRE Bruno Connelly #LinkedInWIT The Past, Present and Future of Big Data @ LinkedIn

2. People You May Know Suja Viswesan SR ENGINEERING MANAGER, BIG DATA PLATFORM

3. MEMBERS COMPANIES JOBS SKILLS SCHOOLS KNOWLEDGE

5. Scale of Processing @ 2.3 Trillion Messages per Day 0.6 PB in 2.3 PB out per Day (compressed) 16 Million Messages per Second at peaks! 4.6K users 125 TB ingested per day 120 PB of HDFS 224K jobs per day across 13 clusters (9 K nodes) 220+ Applications Most Applications require Stateful Processing ~ several TBs (overall) 800+ nodes across 9 clusters samza

6. Big Data! Collect - Collect User Events from Across the Globe - Eg. Page Views, Feed Impressions, Connections - Multiple Sources of Data - Transport Data with Low Latency - Scale - 2.3 trillion msgs/day (~2.5 PB) (Pymk Scale ~10K msg/sec)

7. Big Data! Collect - Collect User Events from Across the Globe - Eg. Page Views, Feed Impressions, Connections - Multiple Sources of Data - Transport Data with Low Latency - Scale - 2.3 trillion msgs/day (~2.5 PB) (Pymk Scale ~10K msg/sec) Process - Highly Reliable and Fault-tolerant Processing of Events - Offline Batch Processing - Near-realtime Stream Processing - Seamlessly Transport Results from Offline Processing to Online Services

8. Big Data! Collect - Collect User Events from Across the Globe - Eg. Page Views, Feed Impressions, Connections - Multiple Sources of Data - Transport Data with Low Latency - Scale - 2.3 trillion msgs/day (~2.5 PB) (Pymk Scale ~10K msg/sec) Process - Highly Reliable and Fault-tolerant Processing of Events - Offline Batch Processing - Near-realtime Stream Processing - Seamlessly Transport Results from Offline Processing to Online Services Access - Persist Data Durably - High availability for Serving Online Services - Data should be Searchable

9. Analytics Infrastructure Gobblin Espresso Data Sources 3rd Party Services Data Ingestion Oracle DB HDFS Voldemort Data Storage Dataset Management Dali Datasets

10. Analytics Infrastructure A/B testing Cluster Management Compute Engines Workflow Orchestration Usecases Relevance Analytics Reporting YARN Azkaban

11. Analytics Infrastructure Challenges Computation Cluster Management System Scaling up computation ● Limited shared computation resources ● Efficient computation to cut down cost of jobs Scaling up cluster management ● Thousands of daily active cluster users ● Hundreds of thousands of daily jobs ● A mix of SLA requirements Scaling up system ● Tens of thousands of nodes ● Tens of PT of data THESCALINGPYRAMID

12. Our Solutions Scaling up system ● Federated HDFS ● Dali - Logical Data Access Layer for Hadoop Scaling up cluster management ● Hadoop OrgQueue ● Elasticity Tuner Scaling up computation ● Dr. Elephant ● Better computation strategy for handling large datasets

13. LinkedIn Open Source Projects Pinot Dr Elephant Cubert Streaming Near Realtime Stream Processing Data Management Performance Tuning OLAP Storage Computation EngineWorkflow Manager samza Photon - ML

14. Bruno Connelly See you at Grace Hopper Celebration!

The Past, Present and Future of Big Data @LinkedIn

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The Past, Present and Future of Big Data @LinkedIn

Similar to The Past, Present and Future of Big Data @LinkedIn (20)

Recently uploaded

Recently uploaded (20)

The Past, Present and Future of Big Data @LinkedIn