This document summarizes a presentation about systems that operate at large scale like Google. It discusses the hardware infrastructure of datacenters with over 10,000 machines and customized networking. It then covers the major software stacks at Google including low-level storage systems like GFS and Colossus, distributed databases like BigTable and Spanner, and batch processing with MapReduce. Similar software stacks at Facebook and differences from high performance computing are also mentioned. Current research directions in distributed systems and operating at large scale are briefly outlined.
What does it take to make google work at scale xlight
The document discusses the software stacks used by companies like Google and Facebook to run services at massive scale. It describes key components of their infrastructure like MapReduce for distributed processing, BigTable for storage, and Borg for cluster management. The stacks are optimized for data-intensive workloads across thousands of machines through replication, load balancing, and fault tolerance. Current research aims to improve utilization, stream processing, and distributed algorithms to enable real-time insights from huge datasets.
This document provides an overview and deep dive into Robinhood's RDS Data Lake architecture for ingesting data from their RDS databases into an S3 data lake. It discusses their prior daily snapshotting approach, and how they implemented a faster change data capture pipeline using Debezium to capture database changes and ingest them incrementally into a Hudi data lake. It also covers lessons learned around change data capture setup and configuration, initial table bootstrapping, data serialization formats, and scaling the ingestion process. Future work areas discussed include orchestrating thousands of pipelines and improving downstream query performance.
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Chicago Hadoop Users Group
John Leach Co-Founder and CTO of Splice Machine with 15+ years software development and machine learning experience will discuss how to use HBase co-processors to build an ANSI-99 SQL database with 1) parallelization of SQL execution plans, 2) ACID transactions with snapshot isolation and 3) consistent secondary indexing.
Transactions are critical in traditional RDBMSs because they ensure reliable updates across multiple rows and tables. Most operational applications require transactions, but even analytics systems use transactions to reliably update secondary indexes after a record insert or update.
In the Hadoop ecosystem, HBase is a key-value store with real-time updates, but it does not have multi-row, multi-table transactions, secondary indexes or a robust query language like SQL. Combining SQL with a full transactional model over HBase opens a whole new set of OLTP and OLAP use cases for Hadoop that was traditionally reserved for RDBMSs like MySQL or Oracle. However, a transactional HBase system has the advantage of scaling out with commodity servers, leading to a 5x-10x cost savings over traditional databases like MySQL or Oracle.
HBase co-processors, introduced in release 0.92, provide a flexible and high-performance framework to extend HBase. In this talk, we show how we used HBase co-processors to support a full ANSI SQL RDBMS without modifying the core HBase source. We will discuss how endpoint transactions are used to serialize SQL execution plans over to regions so that computation is local to where the data is stored. Additionally, we will show how observer co-processors simultaneously support both transactions and secondary indexing.
The talk will also discuss how Splice Machine extended the work of Google Percolator, Yahoo Labs’ OMID, and the University of Waterloo on distributed snapshot isolation for transactions. Lastly, performance benchmarks will be provided, including full TPC-C and TPC-H results that show how Hadoop/HBase can be a replacement of traditional RDBMS solutions.
To view the accompanying slide deck: http://www.slideshare.net/ChicagoHUG/
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduceHadoop User Group
The document discusses the Public Terabyte Dataset Project which aims to create a large crawl of top US domains for public use on Amazon's cloud. It describes how the project uses various Amazon Web Services like Elastic MapReduce and SimpleDB along with technologies like Hadoop, Cascading, and Tika for web crawling and data processing. Common issues encountered include configuration problems, slow performance from fetching all web pages or using Tika language detection, and generating log files instead of results.
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
(Berkeley CS186 guest lecture)
Big Data Analytics Systems: What Goes Around Comes Around
Introduction to MapReduce, GFS, HDFS, Spark, and differences between "Big Data" and database systems.
Stream Computing (The Engineer's Perspective)Ilya Ganelin
This document discusses stream computing from an engineer's perspective. It begins by contrasting batch and stream processing, noting that stream processing handles data one record at a time with an emphasis on latency over throughput. The document then explores how to achieve scalability, performance, durability and availability in stream processing systems. It notes the tradeoffs between these goals and discusses challenges like handling failures. Specific open-source stream processing systems like Storm, Flink and Apex are then analyzed in terms of how they work, strengths, weaknesses and failure handling. The document concludes by discussing using distributed databases for state management in stream processing applications.
Introducing Apache Giraph for Large Scale Graph Processingsscdotopen
This document introduces Apache Giraph, an open source implementation of Google's Pregel framework for large scale graph processing. Giraph allows for distributed graph computation using the bulk synchronous parallel (BSP) model. Key points:
- Giraph uses the vertex-centric programming model where computation is defined in terms of messages passed between vertices.
- It runs on Hadoop and uses its master-slave architecture, with the master coordinating workers that hold vertex partitions.
- PageRank is given as a example algorithm, where each vertex computes its rank based on messages from neighbors in each superstep until convergence.
- Giraph handles fault tolerance, uses ZooKeeper for coordination, and allows graph algorithms
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Hadoop User Group
The document summarizes Twitter's use of Hadoop, Pig, and HBase for their data pipeline. Key points include:
1) Twitter ingests 7 TB of data daily and runs 20,000 Hadoop jobs across a cluster to process logs, tweets, and other data.
2) They use Pig for ETL processes and to load data into HBase and MySQL for querying and reports.
3) HBase is used for mutable data like user tables where updates need to be resolved, while HDFS is used for immutable data like logs.
What does it take to make google work at scale xlight
The document discusses the software stacks used by companies like Google and Facebook to run services at massive scale. It describes key components of their infrastructure like MapReduce for distributed processing, BigTable for storage, and Borg for cluster management. The stacks are optimized for data-intensive workloads across thousands of machines through replication, load balancing, and fault tolerance. Current research aims to improve utilization, stream processing, and distributed algorithms to enable real-time insights from huge datasets.
This document provides an overview and deep dive into Robinhood's RDS Data Lake architecture for ingesting data from their RDS databases into an S3 data lake. It discusses their prior daily snapshotting approach, and how they implemented a faster change data capture pipeline using Debezium to capture database changes and ingest them incrementally into a Hudi data lake. It also covers lessons learned around change data capture setup and configuration, initial table bootstrapping, data serialization formats, and scaling the ingestion process. Future work areas discussed include orchestrating thousands of pipelines and improving downstream query performance.
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Chicago Hadoop Users Group
John Leach Co-Founder and CTO of Splice Machine with 15+ years software development and machine learning experience will discuss how to use HBase co-processors to build an ANSI-99 SQL database with 1) parallelization of SQL execution plans, 2) ACID transactions with snapshot isolation and 3) consistent secondary indexing.
Transactions are critical in traditional RDBMSs because they ensure reliable updates across multiple rows and tables. Most operational applications require transactions, but even analytics systems use transactions to reliably update secondary indexes after a record insert or update.
In the Hadoop ecosystem, HBase is a key-value store with real-time updates, but it does not have multi-row, multi-table transactions, secondary indexes or a robust query language like SQL. Combining SQL with a full transactional model over HBase opens a whole new set of OLTP and OLAP use cases for Hadoop that was traditionally reserved for RDBMSs like MySQL or Oracle. However, a transactional HBase system has the advantage of scaling out with commodity servers, leading to a 5x-10x cost savings over traditional databases like MySQL or Oracle.
HBase co-processors, introduced in release 0.92, provide a flexible and high-performance framework to extend HBase. In this talk, we show how we used HBase co-processors to support a full ANSI SQL RDBMS without modifying the core HBase source. We will discuss how endpoint transactions are used to serialize SQL execution plans over to regions so that computation is local to where the data is stored. Additionally, we will show how observer co-processors simultaneously support both transactions and secondary indexing.
The talk will also discuss how Splice Machine extended the work of Google Percolator, Yahoo Labs’ OMID, and the University of Waterloo on distributed snapshot isolation for transactions. Lastly, performance benchmarks will be provided, including full TPC-C and TPC-H results that show how Hadoop/HBase can be a replacement of traditional RDBMS solutions.
To view the accompanying slide deck: http://www.slideshare.net/ChicagoHUG/
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduceHadoop User Group
The document discusses the Public Terabyte Dataset Project which aims to create a large crawl of top US domains for public use on Amazon's cloud. It describes how the project uses various Amazon Web Services like Elastic MapReduce and SimpleDB along with technologies like Hadoop, Cascading, and Tika for web crawling and data processing. Common issues encountered include configuration problems, slow performance from fetching all web pages or using Tika language detection, and generating log files instead of results.
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
(Berkeley CS186 guest lecture)
Big Data Analytics Systems: What Goes Around Comes Around
Introduction to MapReduce, GFS, HDFS, Spark, and differences between "Big Data" and database systems.
Stream Computing (The Engineer's Perspective)Ilya Ganelin
This document discusses stream computing from an engineer's perspective. It begins by contrasting batch and stream processing, noting that stream processing handles data one record at a time with an emphasis on latency over throughput. The document then explores how to achieve scalability, performance, durability and availability in stream processing systems. It notes the tradeoffs between these goals and discusses challenges like handling failures. Specific open-source stream processing systems like Storm, Flink and Apex are then analyzed in terms of how they work, strengths, weaknesses and failure handling. The document concludes by discussing using distributed databases for state management in stream processing applications.
Introducing Apache Giraph for Large Scale Graph Processingsscdotopen
This document introduces Apache Giraph, an open source implementation of Google's Pregel framework for large scale graph processing. Giraph allows for distributed graph computation using the bulk synchronous parallel (BSP) model. Key points:
- Giraph uses the vertex-centric programming model where computation is defined in terms of messages passed between vertices.
- It runs on Hadoop and uses its master-slave architecture, with the master coordinating workers that hold vertex partitions.
- PageRank is given as a example algorithm, where each vertex computes its rank based on messages from neighbors in each superstep until convergence.
- Giraph handles fault tolerance, uses ZooKeeper for coordination, and allows graph algorithms
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Hadoop User Group
The document summarizes Twitter's use of Hadoop, Pig, and HBase for their data pipeline. Key points include:
1) Twitter ingests 7 TB of data daily and runs 20,000 Hadoop jobs across a cluster to process logs, tweets, and other data.
2) They use Pig for ETL processes and to load data into HBase and MySQL for querying and reports.
3) HBase is used for mutable data like user tables where updates need to be resolved, while HDFS is used for immutable data like logs.
This document provides a high-level overview of Impala, an open-source SQL query engine for Apache Hadoop. It describes how Impala addresses limitations of MapReduce by providing faster, more interactive queries using MPP (Massively Parallel Processing). Key points include that Impala runs directly on data files without ETL, uses a distributed query planner and execution engine for high performance, and supports commonly used file formats like Parquet for columnar storage.
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...rhatr
Apache Giraph allows users to start analyzing graph relationships in big data within 45 minutes. It is an Apache Hadoop-based framework for graph processing that uses the Bulk Synchronous Parallel (BSP) model. Giraph allows for extracting graph relationships from unstructured data and iterative, exploratory analytics on large graphs distributed across a cluster. It provides a programming model and API for graph processing that leverages Hadoop and HDFS for storage and parallelism.
The document compares and contrasts the SAS and Spark frameworks. It provides an overview of their programming models, with SAS using data steps and procedures while Spark uses Scala and distributed datasets. Examples are shown of common tasks like loading data, sorting, grouping, and regression in both SAS Proc SQL and Spark SQL. Spark MLlib is described as Spark's machine learning library, in contrast to SAS Stats. Finally, Spark Streaming is demonstrated for loading and querying streaming data from Kafka. The key takeaways recommend trying Spark for large data, distributed computing, better control of code, open source licensing, or leveraging Hadoop data.
Steve Watt, Chief Architect, Hadoop and Big Data, Red Hat - 21st BDL meetupbigdatalondon
"Hadoop Analytics on your data in place"
Steve Watt leads engineering for the Hadoop and Big Data program at Red Hat. Most recently Steve has been focusing on Hadoop Interoperability and better enabling Hadoop support for alternative filesystems. Prior to Red Hat, Steve spent 2 years at Hewlett-Packard, first co-founding the Hadoop business and then leading engineering as the Hadoop CTO. Prior to HP, Steve was at IBM for 10 years where he created IBMs first Hadoop Distribution and was part of the team that built BigSheets, the first spreadsheet interface for Hadoop.
Building a Scalable Web Crawler with Hadoop by Ahad Rana from CommonCrawl
Ahad Rana, engineer at CommonCrawl, will go over CommonCrawl’s extensive use of Hadoop to fulfill their mission of building an open, and accessible Web-Scale crawl. He will discuss their Hadoop data processing pipeline, including their PageRank implementation, describe techniques they use to optimize Hadoop, discuss the design of their URL Metadata service, and conclude with details on how you can leverage the crawl (using Hadoop) today.
HBaseCon 2015: HBase Operations in a FlurryHBaseCon
With multiple clusters of 1,000+ nodes replicated across multiple data centers, Flurry has learned many operational lessons over the years. In this talk, you'll explore the challenges of maintaining and scaling Flurry's cluster, how we monitor, and how we diagnose and address potential problems.
This document summarizes Facebook's real-time analytics systems. It describes Data Freeway, which uses a scalable data streaming framework to collect log data with low latency. It also describes Puma, which performs reliable stream aggregation and storage by sharding computations in memory and checkpointing to HBase. Future work may include open sourcing components and adding scheduler support.
Prediction as a service with ensemble model in SparkML and Python ScikitLearnJosef A. Habdank
Watch the recording of the speech done at Spark Summit Brussles 2016 here:
https://www.youtube.com/watch?v=wyfTjd9z1sY
Data Science with SparkML on DataBricks is a perfect platform for application of Ensemble Learning on massive a scale. This presentation describes Prediction-as-a-Service platform which can predict trends on 1 billion observed prices daily. In order to train ensemble model on a multivariate time series in thousands/millions dimensional space, one has to fragment the whole space into subspaces which exhibit a significant similarity. In order to achieve this, the vastly sparse space has to undergo dimensionality reduction into a parameters space which then is used to cluster the observations. The data in the resulting clusters is modeled in parallel using machine learning tools capable of coefficient estimation at the massive scale (SparkML and Scikit Learn). The estimated model coefficients are stored in a database to be used when executing predictions on demand via a web service. This approach enables training models fast enough to complete the task within a couple of hours, allowing daily or even real time updates of the coefficients. The above machine learning framework is used to predict the airfares used as support tool for the airline Revenue Management systems.
HBase is an open-source, distributed, scalable BigTable-like database built on top of Hadoop that allows for the fast storage and retrieval of large amounts of structured and unstructured data across clusters. It provides a fault-tolerant way of storing large amounts of sparse data and supports real-time read/write access to this data. HBase organizes data into tables divided into rows and columns and is well-suited for applications that benefit from random, real-time read/write access to large datasets.
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
Adobe’s Unified Profile System is the heart of its Experience Platform. It ingests TBs of data a day and is PBs large. As part of this massive growth we have faced multiple challenges in our Apache Spark deployment which is used from Ingestion to Processing.
Transformation Processing Smackdown; Spark vs Hive vs PigLester Martin
This document provides an overview and comparison of different data transformation frameworks including Apache Pig, Apache Hive, and Apache Spark. It discusses features such as file formats, source to target mappings, data quality checks, and core processing functionality. The document contains code examples demonstrating how to perform common ETL tasks in each framework using delimited, XML, JSON, and other file formats. It also covers topics like numeric validation, data mapping, and performance. The overall purpose is to help users understand the different options for large-scale data processing in Hadoop.
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks
Over the years, there has been extensive and continuous effort on improving Spark SQL’s query optimizer and planner, in order to generate high quality query execution plans. One of the biggest improvements is the cost-based optimization framework that collects and leverages a variety of data statistics (e.g., row count, number of distinct values, NULL values, max/min values, etc.) to help Spark make better decisions in picking the most optimal query plan.
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
In this talk, we’ll present techniques for visualizing large scale machine learning systems in Spark. These are techniques that are employed by Netflix to understand and refine the machine learning models behind Netflix’s famous recommender systems that are used to personalize the Netflix experience for their 99 millions members around the world. Essential to these techniques is Vegas, a new OSS Scala library that aims to be the “missing MatPlotLib” for Spark/Scala. We’ll talk about the design of Vegas and its usage in Scala notebooks to visualize Machine Learning Models.
This document discusses best practices for scaling Hadoop applications. It begins with examples of the importance of scaling applications as user bases and data sizes grow rapidly. It then explains concepts like linear scalability, reducing latency and increasing throughput. Key factors that can limit scalability are identified as sequential bottlenecks, load imbalance, over-partitioning, critical paths, algorithmic overheads, synchronization issues, and task granularity. The document concludes with best practices for Hadoop like using higher-level languages, tuning file and buffer sizes, using compression, and minimizing network requests.
Datacenter Computing with Apache Mesos - BigData DCPaco Nathan
The document discusses datacenter computing using Apache Mesos. It begins by discussing concepts like "data democratization" and "cluster democratization", which refer to making data and computing resources available throughout an organization. It then discusses lessons from Google's approach to datacenter computing, and frameworks that can be integrated with Mesos like Hadoop, Spark, and Docker. Examples of companies using Mesos in production are provided, including Twitter, Airbnb, and eBay. Mesos provides a common substrate that makes heterogeneous computing resources available as a homogeneous set, improving scalability, elasticity, fault tolerance and resource utilization.
Jumpstart your career with the world’s most in-demand technology: Hadoop. Hadooptrainingacademy provides best Hadoop online training with quality videos, comprehensive
online live training and detailed study material. Join today!
For more info, visit: http://www.hadooptrainingacademy.com/
Contact Us:
8121660088
732-419-2619
http://www.hadooptrainingacademy.com/
This document provides an overview of MapReduce, a programming model developed by Google for processing and generating large datasets in a distributed computing environment. It describes how MapReduce abstracts away the complexities of parallelization, fault tolerance, and load balancing to allow developers to focus on the problem logic. Examples are given showing how MapReduce can be used for tasks like word counting in documents and joining datasets. Implementation details and usage statistics from Google demonstrate how MapReduce has scaled to process exabytes of data across thousands of machines.
Enterprise Scale Topological Data Analysis Using SparkAlpine Data
This document discusses scaling topological data analysis (TDA) using the Mapper algorithm to analyze large datasets. It describes how the authors built the first open-source scalable implementation of Mapper called Betti Mapper using Spark. Betti Mapper uses locality-sensitive hashing to bin data points and compute topological summaries on prototype points to achieve an 8-11x performance improvement over a naive Spark implementation. The key aspects of Betti Mapper that enable scaling to enterprise datasets are locality-sensitive hashing for sampling and using prototype points to reduce the distance matrix computation.
HBaseCon2017 Community-Driven Graphs with JanusGraphHBaseCon
Graphs are well-suited for many use cases to express and process complex relationships among entities in enterprise and social contexts. Fueled by the growing interest in graphs, there are various graph databases and processing systems that dot the graph landscape. JanusGraph is a community-driven project that continues the legacy of Titan, a pioneer of open source graph databases. JanusGraph is a scalable graph database optimized for large scale transactional and analytical graph processing. In the session, we will introduce JanusGraph, which features full integration with the Apache TinkerPop graph stack. We will discuss JanusGraph's optimized storage model that relies on HBase for fast graph transversal and processing.
by Jason Plurad and Jing Chen He of IBM
140614 bigdatacamp-la-keynote-jon hsiehData Con LA
The document discusses the evolution of big data stacks from their origins inspired by Google's systems through imitation via Hadoop-based stacks to ongoing innovation. It traces the development of major components like MapReduce, HDFS, HBase and their adoption beyond Google. It also outlines the timeline of open source projects and companies in this space from 2003 to the present.
This document discusses how big data technologies can help operations teams. It provides an overview of the challenges operations teams face in collecting and analyzing large amounts of distributed log and metric data. It then describes Facebook's use of Hadoop, Hive, and Scribe for distributed storage, processing, and logging. Key parts of these systems are explained, along with improvements made to optimize performance and reliability. The conclusion advocates making big data easy to use, logging more data, building debugging tools, and doing both real-time and historical analysis to help operations with big data.
This document provides a high-level overview of Impala, an open-source SQL query engine for Apache Hadoop. It describes how Impala addresses limitations of MapReduce by providing faster, more interactive queries using MPP (Massively Parallel Processing). Key points include that Impala runs directly on data files without ETL, uses a distributed query planner and execution engine for high performance, and supports commonly used file formats like Parquet for columnar storage.
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...rhatr
Apache Giraph allows users to start analyzing graph relationships in big data within 45 minutes. It is an Apache Hadoop-based framework for graph processing that uses the Bulk Synchronous Parallel (BSP) model. Giraph allows for extracting graph relationships from unstructured data and iterative, exploratory analytics on large graphs distributed across a cluster. It provides a programming model and API for graph processing that leverages Hadoop and HDFS for storage and parallelism.
The document compares and contrasts the SAS and Spark frameworks. It provides an overview of their programming models, with SAS using data steps and procedures while Spark uses Scala and distributed datasets. Examples are shown of common tasks like loading data, sorting, grouping, and regression in both SAS Proc SQL and Spark SQL. Spark MLlib is described as Spark's machine learning library, in contrast to SAS Stats. Finally, Spark Streaming is demonstrated for loading and querying streaming data from Kafka. The key takeaways recommend trying Spark for large data, distributed computing, better control of code, open source licensing, or leveraging Hadoop data.
Steve Watt, Chief Architect, Hadoop and Big Data, Red Hat - 21st BDL meetupbigdatalondon
"Hadoop Analytics on your data in place"
Steve Watt leads engineering for the Hadoop and Big Data program at Red Hat. Most recently Steve has been focusing on Hadoop Interoperability and better enabling Hadoop support for alternative filesystems. Prior to Red Hat, Steve spent 2 years at Hewlett-Packard, first co-founding the Hadoop business and then leading engineering as the Hadoop CTO. Prior to HP, Steve was at IBM for 10 years where he created IBMs first Hadoop Distribution and was part of the team that built BigSheets, the first spreadsheet interface for Hadoop.
Building a Scalable Web Crawler with Hadoop by Ahad Rana from CommonCrawl
Ahad Rana, engineer at CommonCrawl, will go over CommonCrawl’s extensive use of Hadoop to fulfill their mission of building an open, and accessible Web-Scale crawl. He will discuss their Hadoop data processing pipeline, including their PageRank implementation, describe techniques they use to optimize Hadoop, discuss the design of their URL Metadata service, and conclude with details on how you can leverage the crawl (using Hadoop) today.
HBaseCon 2015: HBase Operations in a FlurryHBaseCon
With multiple clusters of 1,000+ nodes replicated across multiple data centers, Flurry has learned many operational lessons over the years. In this talk, you'll explore the challenges of maintaining and scaling Flurry's cluster, how we monitor, and how we diagnose and address potential problems.
This document summarizes Facebook's real-time analytics systems. It describes Data Freeway, which uses a scalable data streaming framework to collect log data with low latency. It also describes Puma, which performs reliable stream aggregation and storage by sharding computations in memory and checkpointing to HBase. Future work may include open sourcing components and adding scheduler support.
Prediction as a service with ensemble model in SparkML and Python ScikitLearnJosef A. Habdank
Watch the recording of the speech done at Spark Summit Brussles 2016 here:
https://www.youtube.com/watch?v=wyfTjd9z1sY
Data Science with SparkML on DataBricks is a perfect platform for application of Ensemble Learning on massive a scale. This presentation describes Prediction-as-a-Service platform which can predict trends on 1 billion observed prices daily. In order to train ensemble model on a multivariate time series in thousands/millions dimensional space, one has to fragment the whole space into subspaces which exhibit a significant similarity. In order to achieve this, the vastly sparse space has to undergo dimensionality reduction into a parameters space which then is used to cluster the observations. The data in the resulting clusters is modeled in parallel using machine learning tools capable of coefficient estimation at the massive scale (SparkML and Scikit Learn). The estimated model coefficients are stored in a database to be used when executing predictions on demand via a web service. This approach enables training models fast enough to complete the task within a couple of hours, allowing daily or even real time updates of the coefficients. The above machine learning framework is used to predict the airfares used as support tool for the airline Revenue Management systems.
HBase is an open-source, distributed, scalable BigTable-like database built on top of Hadoop that allows for the fast storage and retrieval of large amounts of structured and unstructured data across clusters. It provides a fault-tolerant way of storing large amounts of sparse data and supports real-time read/write access to this data. HBase organizes data into tables divided into rows and columns and is well-suited for applications that benefit from random, real-time read/write access to large datasets.
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
Adobe’s Unified Profile System is the heart of its Experience Platform. It ingests TBs of data a day and is PBs large. As part of this massive growth we have faced multiple challenges in our Apache Spark deployment which is used from Ingestion to Processing.
Transformation Processing Smackdown; Spark vs Hive vs PigLester Martin
This document provides an overview and comparison of different data transformation frameworks including Apache Pig, Apache Hive, and Apache Spark. It discusses features such as file formats, source to target mappings, data quality checks, and core processing functionality. The document contains code examples demonstrating how to perform common ETL tasks in each framework using delimited, XML, JSON, and other file formats. It also covers topics like numeric validation, data mapping, and performance. The overall purpose is to help users understand the different options for large-scale data processing in Hadoop.
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks
Over the years, there has been extensive and continuous effort on improving Spark SQL’s query optimizer and planner, in order to generate high quality query execution plans. One of the biggest improvements is the cost-based optimization framework that collects and leverages a variety of data statistics (e.g., row count, number of distinct values, NULL values, max/min values, etc.) to help Spark make better decisions in picking the most optimal query plan.
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
In this talk, we’ll present techniques for visualizing large scale machine learning systems in Spark. These are techniques that are employed by Netflix to understand and refine the machine learning models behind Netflix’s famous recommender systems that are used to personalize the Netflix experience for their 99 millions members around the world. Essential to these techniques is Vegas, a new OSS Scala library that aims to be the “missing MatPlotLib” for Spark/Scala. We’ll talk about the design of Vegas and its usage in Scala notebooks to visualize Machine Learning Models.
This document discusses best practices for scaling Hadoop applications. It begins with examples of the importance of scaling applications as user bases and data sizes grow rapidly. It then explains concepts like linear scalability, reducing latency and increasing throughput. Key factors that can limit scalability are identified as sequential bottlenecks, load imbalance, over-partitioning, critical paths, algorithmic overheads, synchronization issues, and task granularity. The document concludes with best practices for Hadoop like using higher-level languages, tuning file and buffer sizes, using compression, and minimizing network requests.
Datacenter Computing with Apache Mesos - BigData DCPaco Nathan
The document discusses datacenter computing using Apache Mesos. It begins by discussing concepts like "data democratization" and "cluster democratization", which refer to making data and computing resources available throughout an organization. It then discusses lessons from Google's approach to datacenter computing, and frameworks that can be integrated with Mesos like Hadoop, Spark, and Docker. Examples of companies using Mesos in production are provided, including Twitter, Airbnb, and eBay. Mesos provides a common substrate that makes heterogeneous computing resources available as a homogeneous set, improving scalability, elasticity, fault tolerance and resource utilization.
Jumpstart your career with the world’s most in-demand technology: Hadoop. Hadooptrainingacademy provides best Hadoop online training with quality videos, comprehensive
online live training and detailed study material. Join today!
For more info, visit: http://www.hadooptrainingacademy.com/
Contact Us:
8121660088
732-419-2619
http://www.hadooptrainingacademy.com/
This document provides an overview of MapReduce, a programming model developed by Google for processing and generating large datasets in a distributed computing environment. It describes how MapReduce abstracts away the complexities of parallelization, fault tolerance, and load balancing to allow developers to focus on the problem logic. Examples are given showing how MapReduce can be used for tasks like word counting in documents and joining datasets. Implementation details and usage statistics from Google demonstrate how MapReduce has scaled to process exabytes of data across thousands of machines.
Enterprise Scale Topological Data Analysis Using SparkAlpine Data
This document discusses scaling topological data analysis (TDA) using the Mapper algorithm to analyze large datasets. It describes how the authors built the first open-source scalable implementation of Mapper called Betti Mapper using Spark. Betti Mapper uses locality-sensitive hashing to bin data points and compute topological summaries on prototype points to achieve an 8-11x performance improvement over a naive Spark implementation. The key aspects of Betti Mapper that enable scaling to enterprise datasets are locality-sensitive hashing for sampling and using prototype points to reduce the distance matrix computation.
HBaseCon2017 Community-Driven Graphs with JanusGraphHBaseCon
Graphs are well-suited for many use cases to express and process complex relationships among entities in enterprise and social contexts. Fueled by the growing interest in graphs, there are various graph databases and processing systems that dot the graph landscape. JanusGraph is a community-driven project that continues the legacy of Titan, a pioneer of open source graph databases. JanusGraph is a scalable graph database optimized for large scale transactional and analytical graph processing. In the session, we will introduce JanusGraph, which features full integration with the Apache TinkerPop graph stack. We will discuss JanusGraph's optimized storage model that relies on HBase for fast graph transversal and processing.
by Jason Plurad and Jing Chen He of IBM
140614 bigdatacamp-la-keynote-jon hsiehData Con LA
The document discusses the evolution of big data stacks from their origins inspired by Google's systems through imitation via Hadoop-based stacks to ongoing innovation. It traces the development of major components like MapReduce, HDFS, HBase and their adoption beyond Google. It also outlines the timeline of open source projects and companies in this space from 2003 to the present.
This document discusses how big data technologies can help operations teams. It provides an overview of the challenges operations teams face in collecting and analyzing large amounts of distributed log and metric data. It then describes Facebook's use of Hadoop, Hive, and Scribe for distributed storage, processing, and logging. Key parts of these systems are explained, along with improvements made to optimize performance and reliability. The conclusion advocates making big data easy to use, logging more data, building debugging tools, and doing both real-time and historical analysis to help operations with big data.
This document summarizes Facebook's Data Freeway and Puma systems for real-time analytics. Data Freeway is a scalable data streaming framework that includes components like Scribe, Calligraphus, and Continuous Copier to reliably stream data at high throughput with low latency. Puma is Facebook's real-time aggregation engine that uses Data Freeway to perform aggregations on streaming data as it arrives and store results in HBase, enabling real-time queries with latency of only seconds. It was improved from initial versions Puma2 and Puma3 by supporting more complex aggregations and queries.
Realtime statistics using Java, Kafka and GraphiteHung Nguyen
This document discusses using Apache Kafka, a message broker, to create real-time statistics. It describes using Kafka to collect monitoring data from applications through a Scribe proxy and send it to Graphite for storage and visualization. The demo application shows injecting code into an existing platform to collect application metrics and send them to Kafka for processing, aggregation and output to Graphite.
Facebook uses a combination of PHP, MySQL, and Memcache (LAMP stack) for their web and application tier. They have also developed various services and tools like Thrift, Scribe, and ODS to handle tasks like logging, monitoring, and communication between systems. Their architecture is designed for scale using principles like simplicity, optimizing for performance, and distributing load. Key components include caching data in Memcache, distributing MySQL databases, and developing services in higher performing languages when needed beyond the capabilities of PHP.
Data Freeway is a system developed by Facebook to handle large volumes of data in real-time at scale. It includes components like Scribe for distributed logging, Calligraphus for persisting logs to HDFS, and Puma for real-time analytics on the data. The system is designed to handle over 10GB/second of data reliably with low latency of less than 10 seconds for 99% of data. It provides a simple interface for applications to access real-time data streams through tools like ptail. The system is open source and used at Facebook to power applications like real-time search, spam detection, and metrics analysis.
Fluentd is an open source data collector that allows flexible data collection, processing, and storage. It collects log data from various sources using input plugins and sends the data to various outputs like files, databases or forward to other Fluentd servers. It uses a pluggable architecture so new input/output plugins can be added through Ruby gems. It provides features like buffering, retries and reliability.
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...DataStax Academy
Typesafe did a survey of Spark usage last year and found that a large percentage of Spark users combine it with Cassandra and Kafka. This talk focuses on streaming data scenarios that demonstrate how these three tools complement each other for building robust, scalable, and flexible data applications. Cassandra provides resilient and scalable storage, with flexible data format and query options. Kafka provides durable, scalable collection of streaming data with message-queue semantics. Spark provides very flexible analytics, everything from classic SQL queries to machine learning and graph algorithms, running in a streaming model based on "mini-batches", offline batch jobs, or interactive queries. We'll consider best practices and areas where improvements are needed.
Ga4 gh meeting at the the sanger instituteMatt Massie
ADAM is a fast, scalable genome analysis platform built using Apache Spark and data formats like Avro and Parquet. It provides tools for read processing, variant calling, and multi-sample analysis on whole genome, high coverage data. The platform is designed to be easy to use for developers while leveraging existing open-source systems and deploying on both local and cloud infrastructures.
Big Data on EC2: Mashing Technology in the CloudGeorge Ang
This document discusses how a startup serving widgets on popular online publications scaled their infrastructure using Amazon Web Services to handle spikes in traffic from over 1 billion users sharing over 10 billion URLs. They used a hub-and-spoke architecture with components like Cascading, Amazon Elastic MapReduce, and AsterData to analyze user sharing patterns in a cost-effective and horizontally scalable way.
Cassandra - A decentralized storage systemArunit Gupta
Cassandra uses consistent hashing to partition and distribute data across nodes in the cluster. Each node is assigned a random position on a ring based on the hash value of the partition key. This allows data to be evenly distributed when nodes join or leave. Cassandra replicates data across multiple nodes for fault tolerance and high availability. It supports different replication policies like rack-aware and datacenter-aware replication to ensure replicas are not co-located. Membership and failure detection in Cassandra uses a gossip protocol and scuttlebutt reconciliation to efficiently discover nodes and detect failures in the distributed system.
The document provides an overview of the Google Cloud Platform (GCP) Data Engineer certification exam, including the content breakdown and question format. It then details several big data technologies in the GCP ecosystem such as Apache Pig, Hive, Spark, and Beam. Finally, it covers various GCP storage options including Cloud Storage, Cloud SQL, Datastore, BigTable, and BigQuery, outlining their key features, performance characteristics, data models, and use cases.
Kudu is a storage engine for Hadoop designed to address gaps in Hadoop's ability to handle workloads that require both high-throughput data ingestion and low-latency random access. It is a columnar storage engine that uses a log-structured merge tree to store data and provides APIs for NoSQL and SQL access. Kudu aims to provide high performance for both scans and random access through its columnar design and tablet architecture that partitions data across servers.
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
With a current zoo of technologies and different ways of their interaction it's a big challenge to architect a system (or adopt existed one) that will conform to low-latency BigData analysis requirements. Apache Kafka and Kappa Architecture in particular take more and more attention over classic Hadoop-centric technologies stack. New Consumer API put significant boost in this direction. Microservices-based streaming processing and new Kafka Streams tend to be a synergy in BigData world.
The document discusses the SMACK stack 1.1, which includes tools for streaming, Mesos, analytics, Cassandra, and Kafka. It describes how SMACK stack 1.1 adds capabilities for dynamic compute, microservices, orchestration, and microsegmentation. It also provides examples of running Storm on Mesos and using Apache Kafka for decoupling data pipelines.
The document discusses best practices for moving Cassandra pilots and proofs of concept (PoCs) to production. It recommends starting with defining queries, building out 5-8 pilots, and designing REST APIs first. When moving to production, considerations include infrastructure, testing, coding practices like using asynchronous execution, data modeling for queries and analytics, and application optimization techniques.
Introduction To Spark - Durham LUG 20150916Ian Pointer
Apache Spark is a fast and general engine for large-scale data processing. It is a framework for massive parallel computing that harnesses the power of cheap memory. Spark provides high performance through its in-memory computing engine and supports Scala, Java, Python and R. It includes modules for streaming, machine learning, graph processing and SQL. Resilient Distributed Datasets (RDDs) are Spark's basic abstraction, which are immutable and fault-tolerant collections that can be operated on in parallel.
How do you get data from your sources into your Redshift data warehouse? We'll show how to use AWS Glue and Amazon Kinesis Firehose to make it easy to automate the work to get data loaded.
Level: Intermediate
Speakers:
Jay Formosa - Solutions Architect, AWS
Aser Moustafa - Data Warehouse Specialist Solutions Architect, AWS
The document discusses Apache Beam, a solution for next generation data processing. It provides a unified programming model for both batch and streaming data processing. Beam allows data pipelines to be written once and run on multiple execution engines. The presentation covers common challenges with historical data processing approaches, how Beam addresses these issues, a demo of running a Beam pipeline on different engines, and how to get involved with the Apache Beam community.
Introduction to apache_cassandra_for_developers-lhgzznate
The document provides an introduction to Apache Cassandra for Java developers, explaining key concepts such as data storage, compaction, consistency levels, and how Cassandra differs from relational databases; it also demonstrates examples of performing common operations like reading, writing, and deleting data using the Hector client library.
KSQL is an open-source streaming SQL engine for Apache Kafka. It allows users to easily interact with and analyze streaming data in Kafka using SQL-like queries. KSQL builds upon Kafka Streams to provide stream processing capabilities with exactly-once processing semantics. It aims to expand access to stream processing beyond coding by providing an interactive SQL interface for tasks like streaming ETL, anomaly detection, real-time monitoring, and simple topic transformations. KSQL can be run in standalone, client-server, or application deployment modes.
Introduciton to Apache Cassandra for Java Developers (JavaOne)zznate
The database industry has been abuzz over the past year about NoSQL databases. Apache Cassandra, which has quickly emerged as a best-of-breed solution in this space, is used at many companies to achieve unprecedented scale while maintaining streamlined operations.
This presentation goes beyond the hype, buzzwords, and rehashed slides and actually presents the attendees with a hands-on, step-by-step tutorial on how to write a Java application on top of Apache Cassandra. It focuses on concepts such as idempotence, tunable consistency, and shared-nothing clusters to help attendees get started with Apache Cassandra quickly while avoiding common pitfalls.
This is an exam cheat sheet hopes to cover all keys points for GCP Data Engineer Certification Exam
Let me know if there is any mistake and I will try to update it
The document discusses different cloud data architectures including streaming processing, Lambda architecture, Kappa architecture, and patterns for implementing Lambda architecture on AWS. It provides an overview of each architecture's components and limitations. The key differences between Lambda and Kappa architectures are outlined, with Kappa being based solely on streaming and using a single technology stack. Finally, various AWS services that can be used to implement Lambda architecture patterns are listed.
The document discusses Apache Cassandra, a distributed database management system designed to handle large amounts of data across many commodity servers. It was developed at Facebook and modeled after Google's Bigtable. The summary discusses key concepts like its use of consistent hashing to distribute data, support for tunable consistency levels, and focus on scalability and availability over traditional SQL features. It also provides an overview of how Cassandra differs from relational databases by not supporting joins, having an optional schema, and using a prematerialized and transaction-less model.
GPS Insight on Using Presto with Scylla for Data Analytics and Data ArchivalScyllaDB
GPS Insight is a leader in fleet vehicle management using IoT. Internally they use a combination of SQL and NoSQL big data technologies, including distributed SQL data analytics via Presto, an open-source query engine developed by Facebook. Learn how to set up, configure, and use Presto with Scylla for supporting ad hoc non-partition key queries for analytics and data scientists. Plus hear how to use Presto for a Data Archival approach with csv files on S3 or similar storage appliance.
Loading Data into Redshift: Data Analytics Week at the SF LoftAmazon Web Services
Loading Data into Redshift: Data Analytics Week at the San Francisco Loft
How do you get data from your sources into your Redshift data warehouse? We'll show how to use AWS Glue and Amazon Kinesis Firehose to make it easy to automate the work to get data loaded.
Level: Intermediate
Speakers:
Aser Moustafa - Data Warehouse Specialist Solutions Architect, AWS
Vikram Gangulavoipalyam - Enterprise Solutions Architect, AWS
Similar to What does it take to make google work at scale (20)
6. CamSa
S
6http://camsas.org @CamSysAtScale
What we’ll chat about
1.Datacenter hardware
2.Scalable software stacks
a.Google
b.Facebook
3.Differences compared to HPC
4.Current research directions
Please ask questions - any
time! :)
12. CamSa
S
12http://camsas.org @CamSysAtScale
How’s this different to HPC?
Emphasis on commodity hardware
No expensive interconnect
Mid-range machines
Energy/performance/cost trade-off essential
Massive automation
Very small number of on-site staff
Automated software bootstrap
Fault tolerant design
Each component can fail
Software must be aware and compensate
13. CamSa
S
13http://camsas.org @CamSysAtScale
The software side
Machine MachineMachine
Linux kernel
(customized)
Linux kernel
(customized)
Linux kernel
(customized)
Transparent distributed systems
Containers
Containers
We’ll look at what goes here!
16. CamSa
S
16http://camsas.org @CamSysAtScale
GFS/Colossus
Bulk data block storage system
Optimized for large files (GB-size)
Supports small files, but not common case
Read, write, append modes
Colossus = GFSv2, adds some improvements
e.g., Reed-Solomon-based erasure coding
better support for latency-sensitive applications
sharded meta-data layer, rather than single master
19. CamSa
S
19http://camsas.org @CamSysAtScale
The Google Stack
Figure from M. Schwarzkopf, “Operating system support for warehouse-scale
computing”, PhD thesis, University of Cambridge, 2015 (to appear).
Details & Bibliography: http://malteschwarzkopf.de/research/assets/google-stack.pdf
21. CamSa
S
21http://camsas.org @CamSysAtScale
BigTable (2006)
• Distributed tablets (~1 GB) hold subsets of map
• On top of Colossus, which handles replication and
fault tolerance: only one (active) server per tablet!
• Reads & writes within a row are transactional
– Independently of the number of columns touched
– But: no cross-row transactions possible
– Turns out users find this hard to deal with
22. CamSa
S
22http://camsas.org @CamSysAtScale
BigTable (2006)
• META0 tablet is “root“ for name resolution
– Like a coordinator/leader
– Nested META1 tablets improve meta-data lookup
• Elect master (META0 tablet server) using
Chubby, and to maintain list of tablet servers &
schemas
– 5-way replicated Paxos consensus on data in Chubby
• Built atop GFS; building block for MegaStore
– “Next gen” with transactions: Spanner
24. CamSa
S
24http://camsas.org @CamSysAtScale
Spanner (2012)
• BigTable insufficient for some consistency needs
• Often have transactions across >1 data centres
– May buy app on Play Store while travelling in the U.S.
– Hit U.S. server, but customer billing data is in U.K.
– Or may need to update several replicas for fault tolerance
• Wide-area consistency is hard
– due to long delays and clock skew
– no global, universal notion of time
– NTP not accurate enough, PTP doesn’t work (jittery links)
25. CamSa
S
25http://camsas.org @CamSysAtScale
Spanner (2012)
• Spanner offers transactional consistency: full RDBMS
power, ACID properties, at global scale!
• Secret sauce: hardware-assisted clock sync
– Using GPS and atomic clocks in data centres
• Use global timestamps and Paxos to reach consensus
– Still have a period of uncertainty for write TX: wait it out!
– Each timestamp is an interval:
Definitely in
the future
tt.latest
Definitely in
the past
tt.earliest
tabs
26. CamSa
S
26http://camsas.org @CamSysAtScale
The Google Stack
Figure from M. Schwarzkopf, “Operating system support for warehouse-scale
computing”, PhD thesis, University of Cambridge, 2015 (to appear).
Details & Bibliography: http://malteschwarzkopf.de/research/assets/google-stack.pdf
27. CamSa
S
27http://camsas.org @CamSysAtScale
MapReduce (2004)
• Parallel programming framework for scale
– Run a program on 100’s to 10,000’s machines
• Framework takes care of:
– Parallelization, distribution, load-balancing, scaling up
(or down) & fault-tolerance
• Accessible: programmer provides two methods ;-)
– map(key, value) → list of <key’, value’> pairs
– reduce(key’, value’) → result
– Inspired by functional programming
28. CamSa
S
28http://camsas.org @CamSysAtScale
MapReduce
Input
Map
Reduce
Output
Shuffle
X: 5 X: 3 Y: 1 Y: 7
Y: 8X: 8
Results: X: 8, Y: 8
Perform Map() query against local data
matching input specification
Aggregate gathered results for each
intermediate key using Reduce()
End user can query results via
distributed key/value store
Slide originally due to S. Hand’s distributed systems lecture course at Cambridge:
http://www.cl.cam.ac.uk/teaching/1112/ConcDisSys/DistributedSystems-1B-H4.pdf
29. CamSa
S
29http://camsas.org @CamSysAtScale
MapReduce: Pros & Cons
• Extremely simple, and:
– Can auto-parallelize (since operations on every
element in input are independent)
– Can auto-distribute (since rely on underlying
Colossus/BigTable distributed storage)
– Gets fault-tolerance (since tasks are idempotent, i.e.
can just re-execute if a machine crashes)
• Doesn’t really use any sophisticated distributed
systems algorithms (except storage replication)
• However, not a panacea:
– Limited to batch jobs, and computations which are
expressible as a map() followed by a reduce()
30. CamSa
S
30http://camsas.org @CamSysAtScale
The Google Stack
Figure from M. Schwarzkopf, “Operating system support for warehouse-scale
computing”, PhD thesis, University of Cambridge, 2015 (to appear).
Details & Bibliography: http://malteschwarzkopf.de/research/assets/google-stack.pdf
32. CamSa
S
32http://camsas.org @CamSysAtScale
Dremel (2010)
Stores protocol buffers
Google’s universal serialization format
Nested messages → nested columns
Repeated fields → repeated records
Efficient encoding
Many sparse records: don’t store NULL fields
Record re-assembly
Need to put results back together into records
Use a Finite State Machine (FSM) defined by
protocol buffer structure
33. CamSa
S
33http://camsas.org @CamSysAtScale
The Google Stack
Figure from M. Schwarzkopf, “Operating system support for warehouse-scale
computing”, PhD thesis, University of Cambridge, 2015 (to appear).
Details & Bibliography: http://malteschwarzkopf.de/research/assets/google-stack.pdf
36. CamSa
S
36http://camsas.org @CamSysAtScale
Borg/Omega: workloads
Jobs/tasks: counts
CPU/RAM: resource seconds [i.e. resource * job runtime in sec.]
Cluster A
Medium size
Medium utilization
Cluster B
Large size
Medium utilization
Cluster C
Medium (12k mach.)
High utilization
Public trace
Figure from M. Schwarzkopf et al., “Omega: flexible, scalable schedulers for large
compute clusters”, Proceedings of EuroSys 2013.
39. CamSa
S
39http://camsas.org @CamSysAtScale
Borg/Omega: workloads
Batch jobs have a longer-tailed CPI distribution:
lower scheduling priority in kernel scheduler.
Figures from M. Schwarzkopf, “Operating system support for warehouse-scale
computing”, PhD thesis, University of Cambridge, 2015 (to appear).
41. CamSa
S
41http://camsas.org @CamSysAtScale
How’s this different to HPC?
Generally avoid tight synchronization
No barrier-syncs!
Asynchronous replication, eventual consistency
Much more data-intensive
Lower compute-to-data ratio
Aggressive resource sharing for utilization
42. CamSa
S
42http://camsas.org @CamSysAtScale
The facebook Stack
Figure from M. Schwarzkopf, “Operating system support for warehouse-scale
computing”, PhD thesis, University of Cambridge, 2015 (to appear).
Details & Bibliography: http://malteschwarzkopf.de/research/assets/facebook-stack.pdf
43. CamSa
S
43http://camsas.org @CamSysAtScale
The facebook Stack
Figure from M. Schwarzkopf, “Operating system support for warehouse-scale
computing”, PhD thesis, University of Cambridge, 2015 (to appear).
Details & Bibliography: http://malteschwarzkopf.de/research/assets/facebook-stack.pdf
44. CamSa
S
44http://camsas.org @CamSysAtScale
The facebook Stack
Figure from M. Schwarzkopf, “Operating system support for warehouse-scale
computing”, PhD thesis, University of Cambridge, 2015 (to appear).
Details & Bibliography: http://malteschwarzkopf.de/research/assets/facebook-stack.pdf
45. CamSa
S
45http://camsas.org @CamSysAtScale
Haystack & f4
Blob stores, hold photos, videos
not: status updates, messages, like counts
Items have a level of hotness
How many users are currently accessing this?
Baseline “cold” storage: MySQL
Want to cache close to users
Reduces network traffic
Reduces latency
But cache capacity is limited!
Replicate for performance, not resilience
47. CamSa
S
47http://camsas.org @CamSysAtScale
How about other companies?
Very similar stacks.
Microsoft, Yahoo, Twitter all similar in principle.
Typical set-up:
Front-end serving systems and fast back-ends.
Batch data processing systems.
Multi-tier structured/unstructured storage hierarchy.
Coordination system and cluster scheduler.
Minor differences owed to business focus
e.g., Amazon focused on inventory/shopping cart.
48. CamSa
S
48http://camsas.org @CamSysAtScale
Open source software
Lots of open reimplementations!
MapReduce → Hadoop, Spark, Metis
GFS → HDFS
BigTable → HBase, Cassandra
Borg/Omega → Mesos, Firmament
But also some releases from companies…
Presto (Facebook)
Kubernetes (Google)
49. CamSa
S
49http://camsas.org @CamSysAtScale
Current research
Many hot topics!
This distributed infrastructure is still new.
Make many-core machines go further
Increase utilization, schedule better.
Deal with co-location interference.
Rapid insights on huge datasets
Fast, incremental stream processing, e.g. Naiad.
Approximate analytics, e.g. BlinkDB.
Distributed algorithms
New consensus systems, e.g. RAFT.
50. CamSa
S
50http://camsas.org @CamSysAtScale
Conclusions
1.Running at huge (10k+ machines) scale
requires different software stacks.
2.Pretty interesting systems
a.Do read the papers!
b.(Partial) Bibliography:
http://malteschwarzkopf.de/research/assets/google-stack.pdf
http://malteschwarzkopf.de/research/assets/facebook-stack.pdf
1.Lots of activity in this area!
a.Open source software
b.Research initiatives