Productive data engineer

•Download as PPTX, PDF•

2 likes•515 views

I both love it and hate Hadoop - I love it because it provides and commodify easy to use, engineer friendly and scalable abstraction layer over cluster of machines. I hate it because of all the gotchas and vast knowledge it requires to be productive throughout the full Hadoop stack. In this talk I will focus, and share the knowledge necessary to be productive data engineer.

Data & Analytics

How to Be
Productive
Data Engineer
Rafal Wojdyla - rav@spotify.com
Note: My views are my own and don't necessarily represent those of Spotify.

• Operations
• Development
• Organization
• Culture

What is Spotify?
For everyone:
• Streaming Service
• Launched in October 2008
• 60 Million Monthly Users
• 15 Million Paid Subscribers
+ and for me:
• 1.3K nodes Hadoop cluster

To enable log aggregation
yarn.log-aggregation-enable = true
yarn.log-aggregation.retain-seconds = ?

+ <property>
+ <name>yarn.log-aggregation-enable</name>
+ <value>true</value>
+ </property>
+
+ <property>
+ <name>yarn.log-aggregation.retain-seconds</name>
+ <value>315569260</value>
+ <!--retention: 10 years-->
+ </property>

Custom logs
• Profiling
• Garbage collection

Scaling
machines is
easy, scaling
people is hard

• Map split size
• Number of reducers
• HDFS data retention
• User feedback (ongoing)
Automation

$Spark spark.storage.memoryFraction (0.6) spark.shuffle.memoryFraction (0.2) In shuffle heavy algorithms reduce cache fraction in favour of shuffle.$

Spark
spark.executor.heartbeatInterval (10K)
spark.core.connection.ack.wait.timeout (60)
Increase in case of long GC pauses.

Learnings
• Operations  Automation
• Development  Abstraction
• Organization  Team
• Culture  Experiment

Join the band
Engineers wanted in
NYC & Stockholm
http://spotify.com/jobs

The quickest way to learn and evolve infrastructure is by encountering obstacles and being forced to overcome limitations that keep you inches away from project goals. At Spotify, we’ve encountered many of these obstacles and frustrations as we grew our Hadoop cluster from a few machines in an office closet aggregating played song events for financial reports, to our current 900 node cluster that plays a large role in many features that you see in our application today. Two members of Spotify’s Hadoop ‘squad’ will weave in war stories, failures, frustrations and lessons learned to describe the Hadoop/Big Data architecture at Spotify and talk about how that architecture has evolved. We’ll talk about how and why we use a number of tools, including Apache Falcon and Apache Bigtop to test changes; Apache Crunch, Scalding and Hive w/ Tez to build features and provide analytics; and Snakebite and Luigi, two in-house tools created to overcome common frustrations.

How Apache Drives Music Recommendations At Spotify

Josh Baer

The Evolution of Big Data at Spotify

Josh Baer

Obvious and Non-Obvious Scalability Issues: Spotify Learnings

David Poblador i Garcia

The document summarizes lessons learned by Spotify about scaling infrastructure and operations. Some key points include: starting with letting experts handle data centers when small, streamlining procurement processes, treating capacity in standardized "pods", focusing infrastructure teams on platforms rather than individual services, implementing automated processes for configuration, provisioning and monitoring, and having individual product teams take on operational responsibilities for their own services with guidance from infrastructure teams. The presentation also covers specific scaling challenges faced with storage, networking, and resilience strategies like retry policies and load shedding.

Big Data At Spotify

Adam Kawa

1) At Spotify, big data is used to answer important questions from various stakeholders like how many times songs have been streamed, most popular artists, and streaming numbers for marketing purposes. 2) Data infrastructure at Spotify includes a large Hadoop cluster with over 6 petabytes of data used to generate insights from user activity logs and improve the product. 3) Answering tricky questions requires techniques like A/B testing and analyzing streaming patterns to determine viral songs or artist reactions to new releases. Data-driven decisions are made to personalize the user experience.

Storm at Spotify

Neville Li

Apache Spark: killer or savior of Apache Hadoop?

rhatr

Elephant in the cloud

rhatr

The Elephant in the Cloud: A Quest for the Next Generation In this talk, I will go through the evolution of Hadoop and its ecosystem projects and will try to peer into the crystal ball to predict what may be coming down the pike. I will discuss various way of crunching the data on Hadoop (MapReduce, OpenMPI, Spark and various SQL engines) and how these tools compliment each other. Apache Hadoop is no longer just a faithful, open source, scalable implementation of two seminal papers that came out of Google 10 years ago. It has evolved into a project that provides the enterprises with a reliable layer for storing massive amounts of unstructured data (HDFS) while allowing different computational frameworks to leverage those datasets. The original computational framework (MapReduce) has evolved into a much more scalable set of general purpose cluster management APIs collectively known as YARN. With YARN underneath, MapReduce is still there to support batch-oriented computations, but it is no longer the only game in town. With OpenMPI, Spark, and Tez rapidly becoming available now is truly the most exciting time to be a developer in a Hadoop ecosystem. It is also the time when you don't have to be employed by Yahoo!, Facebook or EBay to have access to mind-blowing compute power. That power is a credit card and a pivotal.io account away from anybody on the planet. I will conclude by outlining some of the ongoing work that makes Hadoop and its ecosystem projects first class citizens in cloud environments based on the work that Pivotal engineers have done with integrating Hadoop into PivotalONE PaaS.

- Hadoop was created to allow processing of large datasets in a distributed, fault-tolerant manner. It was originally developed by Doug Cutting and Mike Cafarella at Nutch in response to the growing amounts of data and computational needs at Google and other companies. - The core of Hadoop consists of Hadoop Distributed File System (HDFS) for storage and Hadoop MapReduce for distributed processing. It also includes utilities like Hadoop Common for file system access and other basic functionality. - Hadoop's goals were to process multi-petabyte datasets across commodity hardware in a reliable, flexible and open source way. It assumes failures are expected and handles them to provide fault tolerance.

Data at Spotify

Danielle Jabin

Danielle Jabin is a data engineer at Spotify who works on A/B testing infrastructure. She describes Spotify's big data landscape, which includes over 40 million active users generating 1.5 TB of compressed data per day. Spotify collects this user data using Kafka for high-volume data collection, processes it using Hadoop on a large cluster, and stores aggregates in databases like PostgreSQL and Cassandra for analytics and visualization.

Playlist Recommendations @ Spotify

Nikhil Tibrewal

Faster Faster Faster! Datamarts with Hive at Yahoo

Mithun Radhakrishnan

Hadoop Summit 2016 presentation. As Yahoo continues to grow and diversify its mobile products, its platform team faces an ever-expanding, sophisticated group of analysts and product managers, who demand faster answers: to measure key engagement metrics like Daily Active Users, app installs and user retention; to run longitudinal analysis; and to re-calculate metrics over rolling time windows. All of this, quickly enough not to need a coffee break while waiting for results. The optimal solution for this use-case would have to take into account raw performance, cost, security implications and ease of data management. Could the benefits of Hive, ORC and Tez, coupled with a good data design provide the performance our customers crave? Or would it make sense to use more nascent, off-grid querying systems? This talk will examine the efficacy of using Hive for large-scale mobile analytics. We will quantify Hive performance on a traditional, shared, multi-tenant Hadoop cluster, and compare it with more specialized analytics tools on a single-tenant cluster. We will also highlight which tuning parameters yield maximum benefits, and analyze the surprisingly ineffectual ones. Finally, we will detail several enhancements made by Yahoo's Hive team (in split calculation, stripe elimination and the metadata system) to successfully boost performance.

De-Bugging Hive with Hadoop-in-the-Cloud

DataWorks Summit

This document provides an overview of debugging Hive queries with Hadoop in the cloud. It discusses Altiscale's Hadoop as a Service platform and perspective as an operational service provider. It then covers Hadoop 2 architecture, debugging tools, accessing logs in Hadoop 2, the Hive and Hadoop architecture, Hive logs, common Hive issues and case studies on stuck jobs and missing directories. The document aims to help users better understand and troubleshoot Hive queries running on Hadoop clusters.

Apache Pig: Making data transformation easy

Victor Sanchez Anguix

Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of

Charles Givre

Study after study shows that data preparation and other data janitorial work consume 50-90% of most data scientists’ time. Apache Drill is a very promising tool which can help address this. Drill works with many different forms of “self describing data” and allows analysts to run ad-hoc queries in ANSI SQL against that data. Unlike HIVE or other SQL on Hadoop tools, Drill is not a wrapper for Map-Reduce and can scale to clusters of up to 10k nodes.

Hive vs Pig for HadoopSourceCodeReading

Mitsuharu Hamba

The document provides information about Hive and Pig, two frameworks for analyzing large datasets using Hadoop. It compares Hive and Pig, noting that Hive uses a SQL-like language called HiveQL to manipulate data, while Pig uses Pig Latin scripts and operates on data flows. The document also includes code examples demonstrating how to use basic operations in Hive and Pig like loading data, performing word counts, joins, and outer joins on sample datasets.

August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...

Yahoo Developer Network

Spark and Ignite are two of the most popular open source projects in the area of high-performance Big Data and Fast Data. But did you know that one of the best ways to boost performance for your next generation real-time applications is to use them together? In this session, Dmitriy Setrakyan, Apache Ignite Project Management Committee Chairman and co-founder and CPO at GridGain will explain in detail how IgniteRDD — an implementation of native Spark RDD and DataFrame APIs — shares the state of the RDD across other Spark jobs, applications and workers. Dmitriy will also demonstrate how IgniteRDD, with its advanced in-memory indexing capabilities, allows execution of SQL queries many times faster than native Spark RDDs or Data Frames. Don't miss this opportunity to learn from one of the experts how to use Spark and Ignite better together in your projects. Speakers: Dmitriy Setrakyan, is a founder and CPO at GridGain Systems. Dmitriy has been working with distributed architectures for over 15 years and has expertise in the development of various middleware platforms, financial trading systems, CRM applications and similar systems. Prior to GridGain, Dmitriy worked at eBay where he was responsible for the architecture of an add-serving system processing several billion hits a day. Currently Dmitriy also acts as PMC chair of Apache Ignite project.

From oracle to hadoop with Sqoop and other tools

Guy Harrison

This document discusses tools for transferring data between relational databases and Hadoop, focusing on Apache Sqoop. It describes how Sqoop was optimized for Oracle imports and exports, reducing database load by up to 99% and improving performance by 5-20x. It also outlines the goals of Sqoop 2 to improve usability, security, and extensibility through a REST API and by separating responsibilities.

Beeswax Hive editor in Hue

Romain Rigaux

This document discusses Hive Editor in Hue, an open source web interface for Hadoop that includes applications for Hive, Pig, Impala, Oozie, Solr, Sqoop, and HBase. The Hive Editor in Hue provides syntax highlighting, query autocomplete, live progress and logs for Hive queries and MapReduce jobs, the ability to work with multiple databases and statements, and features for saving, exporting, and sharing queries. It connects to HiveServer2 and supports Sentry for authorization. A demo of the Hive Editor and Metastore Browser is provided.

The other Apache technologies your big data solution needs!

gagravarr

NYC HUG - Application Architectures with Apache Hadoop

markgrover

IPython Notebook as a Unified Data Science Interface for Hadoop

DataWorks Summit

This document discusses using IPython Notebook as a unified data science interface for Hadoop. It proposes that a unified environment needs: 1) mixed local and distributed processing via Apache Spark, 2) access to languages like Python via PySpark, 3) seamless SQL integration via SparkSQL, and 4) visualization and reporting via IPython Notebook. The document demonstrates this environment by exploring open payments data between doctors/hospitals and manufacturers.

Hortonworks.Cluster Config Guide

Douglas Bernardini

This document provides guidance on sizing and configuring Apache Hadoop clusters. It recommends separating master nodes, which run processes like the NameNode and JobTracker, from slave nodes, which run DataNodes, TaskTrackers and RegionServers. For medium to large clusters it suggests 4 master nodes and the remaining nodes as slaves. The document outlines factors to consider for optimizing performance and cost like selecting balanced CPU, memory and disk configurations and using a "shared nothing" architecture with 1GbE or 10GbE networking. Redundancy is more important for master than slave nodes.

A glimpse of test automation in hadoop ecosystem by Deepika Achary

QA or the Highway

This document discusses test automation in the Hadoop ecosystem. It provides an overview of key components like HDFS, HBase, Kafka, and Solr. It then describes how to set up test automation for each component using Java libraries and classes. Automating tests provides advantages like creating a test framework, enabling gray box testing, running tests easily in batch mode, ensuring flexibility of test data, quickly finding bugs, and maintaining health of systems. The presentation concludes with key learnings around Big Data, Hadoop components, and how to approach automation.

20131205 hadoop-hdfs-map reduce-introduction

Xuan-Chao Huang

The document provides an overview of Hadoop, an open-source software framework for distributed storage and processing of large datasets. It describes how Hadoop uses HDFS for distributed file storage across clusters and MapReduce for parallel processing of data. Key components of Hadoop include HDFS for storage, YARN for resource management, and MapReduce for distributed computing. The document also discusses some popular Hadoop distributions and real-world uses of Hadoop by companies.

Data Engineering with Spring, Hadoop and Hive

Alex Silva

This presentation will outline the evolution of the monitoring data platform pipeline at Rackspace and explore the compute and data management challenges we have faced at this scale. We will focus on our use of Hadoop and Hive as data storage and transformation platforms while discussing the technology stack, key architectural decisions, observations and pitfalls encountered in building the pipeline.

Introduction to Apache Hive

Avkash Chauhan

Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. It presents a SQL-like interface for querying data stored in various databases and file systems that integrate with Hadoop. The document provides links to Hive documentation, tutorials, presentations and other resources for learning about and using Hive. It also includes a table describing common Hive CLI commands and their usage.

Sqoop tutorial

Ashoka Vanjare

This document provides an overview of the Sqoop tool, which is used to transfer data between Hadoop and relational database servers. Sqoop can import data from databases into HDFS and export data from HDFS to databases. The document describes how Sqoop works, provides installation instructions, and outlines various Sqoop commands for import, export, jobs, code generation, and interacting with databases.

2016 data-science-salary-survey - O’Reilly Data Science

Adam Rabinovitch

IN THIS FOURTH EDITION of the O’Reilly Data Science Salary Survey, they analyzed input from 983 respondents working in the data space, across a variety of industries— representing 45 countries and 45 US states. Through the results of our 64-question survey, we’ve explored which tools data scientists, analysts, and engineers use, which tasks they engage in, and of course—how much they make. Key findings include: • Python and Spark are among the tools that contribute most to salary. • Among those who code, the highest earners are the ones who code the most. • SQL, Excel, R and Python are the most commonly used tools. • Those who attend more meetings, earn more. • Women make less than men, for doing the same thing. • Country and US state GDP serves as a decent proxy for geographic salary variation (not as a direct estimate, but as an additional input for a model). • The most salient division between tool and tasks usage is between those who mostly use Excel, SQL, and a small number of closed source tools—and those who use more open source tools and spend more time coding. • R is used across this division: even people who don’t code much or use many open source tools, use R. • A secondary division emerges among the coding half— separating a younger, Python-heavy data scientist/analyst group, from a more experienced data scientist/engineer cohort that tends to use a high number of tools and earns the highest salaries.

Bridging the Gap Between Data Science & Engineer: Building High-Performance T...

ryanorban

Data scientists, data engineers, and data businesspeople are critical to leveraging data in any organization. A common complaint from data science managers is that data scientists invest time prototyping algorithms, and throw them over a proverbial fence to engineers to implement, only to find the algorithms must be rebuilt from scratch to scale. This is a symptom of a broader ailment -- that data teams are often designed as functional silos without proper communication and planning. This talk outlines a framework to build and organize a data team that produces better results, minimizes wasted effort among team members, and ships great data products.

What's hot

Apache Hadoop 1.1

Sperasoft

Data at Spotify

Danielle Jabin

Playlist Recommendations @ Spotify

Nikhil Tibrewal

Faster Faster Faster! Datamarts with Hive at Yahoo

Mithun Radhakrishnan

De-Bugging Hive with Hadoop-in-the-Cloud

DataWorks Summit

Apache Pig: Making data transformation easy

Victor Sanchez Anguix

Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of

Charles Givre

Hive vs Pig for HadoopSourceCodeReading

Mitsuharu Hamba

August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...

Yahoo Developer Network

From oracle to hadoop with Sqoop and other tools

Guy Harrison

Beeswax Hive editor in Hue

Romain Rigaux

The other Apache technologies your big data solution needs!

gagravarr

NYC HUG - Application Architectures with Apache Hadoop

markgrover

IPython Notebook as a Unified Data Science Interface for Hadoop

DataWorks Summit

Hortonworks.Cluster Config Guide

Douglas Bernardini

A glimpse of test automation in hadoop ecosystem by Deepika Achary

QA or the Highway

20131205 hadoop-hdfs-map reduce-introduction

Xuan-Chao Huang

Data Engineering with Spring, Hadoop and Hive

Alex Silva

Introduction to Apache Hive

Avkash Chauhan

Sqoop tutorial

Ashoka Vanjare

What's hot (20)

Apache Hadoop 1.1

Data at Spotify

Playlist Recommendations @ Spotify

Faster Faster Faster! Datamarts with Hive at Yahoo

De-Bugging Hive with Hadoop-in-the-Cloud

Apache Pig: Making data transformation easy

Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of

Hive vs Pig for HadoopSourceCodeReading

August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...

From oracle to hadoop with Sqoop and other tools

Beeswax Hive editor in Hue

The other Apache technologies your big data solution needs!

NYC HUG - Application Architectures with Apache Hadoop

IPython Notebook as a Unified Data Science Interface for Hadoop

Hortonworks.Cluster Config Guide

A glimpse of test automation in hadoop ecosystem by Deepika Achary

20131205 hadoop-hdfs-map reduce-introduction

Data Engineering with Spring, Hadoop and Hive

Introduction to Apache Hive

Sqoop tutorial

Viewers also liked

2016 data-science-salary-survey - O’Reilly Data Science

Adam Rabinovitch

Bridging the Gap Between Data Science & Engineer: Building High-Performance T...

ryanorban

10 more lessons learned from building Machine Learning systems

Xavier Amatriain

1. Machine learning applications at Quora include answer ranking, feed ranking, topic recommendations, user recommendations, and more. A variety of models are used including logistic regression, gradient boosted decision trees, neural networks, and matrix factorization. 2. Implicit signals like watching and clicking tend to be more useful than explicit signals like ratings. However, both implicit and explicit signals combined can better represent long-term goals. 3. The outputs of machine learning models will often become inputs to other models, so models need to be designed with this in mind to avoid issues like feedback loops.

From Digital Analytics to Insight

Pithan Rojanawong

Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit

Slim Baltagi

This talk given at the Hadoop Summit in San Jose on June 28, 2016, analyzes a few major trends in Big Data analytics. These are a few takeaways from this talk: - Adopt Apache Beam for easier development and portability between Big Data Execution Engines. - Adopt stream analytics for faster time to insight, competitive advantages and operational efficiency. - Accelerate your Big Data applications with In-Memory open source tools. - Adopt Rapid Application Development of Big Data applications: APIs, Notebooks, GUIs, Microservices… - Have Machine Learning part of your strategy or passively watch your industry completely transformed! - How to advance your strategy for hybrid integration between cloud and on-premise deployments?

Big Data Trends

David Feinleib

Data Visualization 101: How to Design Charts and Graphs

Visage

QCon Rio - Machine Learning for Everyone

Dhiana Deva

Já não são mais necessários supercomputadores e times de PhDs do MIT para a criação de modelos preditivos baseados em dados. Estamos presenciando inovações em Aprendizado de Máquina que estão tornando este campo cada vez mais acessível. Esta palestra tem como objetivo desmistificar o aprendizado de máquina, através da exposição de conceitos e uso de uma série de tecnologias. Serão abordados os tipos de problemas desta área(classificação, regressão, clusterização, redução de dimensionalidade, etc.), suas as etapas (normalização, treinamento, otimização, regularização, etc.) e seus algoritmos, desde regressão linear, k-means, passando por árvores de decisão e até redes neurais, sempre aplicadas a problemas reais. Na palestra, também conheceremos ferramentas como Sckit-learn, Pandas, R, MATLAB e Amazon Machine Learning, além de uma forma para praticar e experimentar estas ideias através de competições como o Kaggle.

What is big data?

David Wellman

Diseño circuito

Edgard Roncallo Gonzlez

BIMobject for manufacturers 2013

BIMobject

知能化コミュニケーション最適化20131024Iwasa Tomohiro

Construcción del blog

EmmanuelBustos

Why You Should Team Up and Make Friends: Your Professional Responsibilities W...

Parsons Behle & Latimer

1535

Beatriz Assis

Securitisation and Standby Servicing - a strategic solution from HML

HML Ltd

Final teaching listening june 2012 with recording

Just_Peachy44

The document discusses selective listening and provides tips for practicing it. Selective listening means focusing only on the specific information you need rather than trying to understand everything that is said. It can be used when learning costs, business hours, or processes. The tips include planning what information you need beforehand, listening multiple times if needed, taking notes, and asking the speaker to repeat important points. An example is provided of selectively listening to a movie theater recording to find the price of matinee tickets.

Freello Media Agency

Francesco Pieragostini

Contact Center Softphone UI design

Iwasa Tomohiro

The document proposes a simpler and smarter user interface for phone applications called "Phone Pad" that uses drag and drop gestures instead of multiple buttons. Phone Pad would allow users to change their status by dragging a phone icon to different areas of the interface. Users could also answer calls, make calls, conference calls, transfer calls, and log out by interacting with the phone icon. The interface is intended to make phone applications easier to use compared to traditional button-based interfaces. The proposal is being prepared for patent application.

Oil The Next Revolution (Jun 2012)

Jose Espinosa

This document summarizes an analysis of global oil production capacity to the year 2020. Some key points: 1) Additional unrestricted global oil production of over 49 million barrels per day is targeted by 2020, but after adjusting for risks, the potential increase is estimated to be around 29 million barrels per day. 2) Factoring in depletion rates and reserve growth, the estimated net increase in global oil production capacity by 2020 is around 17.6 million barrels per day, bringing total capacity to around 110.6 million barrels per day. 3) The largest estimated increases in production capacity by 2020 come from Iraq, the United States, Canada, and Brazil. The U.S. increase is particularly significant due

Viewers also liked (20)

2016 data-science-salary-survey - O’Reilly Data Science

Bridging the Gap Between Data Science & Engineer: Building High-Performance T...

10 more lessons learned from building Machine Learning systems

From Digital Analytics to Insight

Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit

Big Data Trends

Data Visualization 101: How to Design Charts and Graphs

QCon Rio - Machine Learning for Everyone

What is big data?

Diseño circuito

BIMobject for manufacturers 2013

知能化コミュニケーション最適化20131024

Construcción del blog

Why You Should Team Up and Make Friends: Your Professional Responsibilities W...

1535

Securitisation and Standby Servicing - a strategic solution from HML

Final teaching listening june 2012 with recording

Freello Media Agency

Contact Center Softphone UI design

Oil The Next Revolution (Jun 2012)

Similar to Productive data engineer

Move out from AppEngine, and Python PaaS alternatives

tzang ms

This document discusses moving a podcast hosting application called MyAudioCast off of Google App Engine (GAE) and onto other Python platforms as a result of high costs and limitations. Some key points: - MyAudioCast was running on GAE for over a year but costs were rising to $120/month due to high storage, bandwidth, and processing usage. - Performance on GAE was poor with high error rates for operations like inserting logs and updating counters. - Development was slowed by GAE limitations like long deployment times and inability to easily use common Python packages. - The author chose to migrate MyAudioCast to the Linode VPS and Heroku PaaS for better pricing,

Spotify in the Cloud - An evolution of data infrastructure - Strata NYC

Josh Baer

Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...

Big Data Spain

Using ClickHouse for Experimentation

Gleb Kanterov

This document discusses using ClickHouse for experimentation and metrics at Spotify. It describes how Spotify built an experimentation platform using ClickHouse to provide teams interactive queries on granular metrics data with low latency. Key aspects include ingesting data from Google Cloud Storage to ClickHouse daily, defining metrics through a centralized catalog, and visualizing metrics and running queries using Superset connected to ClickHouse. The platform aims to reduce load on notebooks and BigQuery by serving common queries directly from ClickHouse.

Open Source Monitoring Tools

m_richardson

This document summarizes and compares several open source monitoring tools: Nagios, Graphite, StatsD, Logstash, and Sensu. Nagios is introduced as a commonly used tool that some love and some find frustrating. Graphite is described as a tool for storing and graphing time-series data. StatsD aggregates counters and timers and sends them to backend services like Graphite. Logstash is an tool for managing logs and events that can input, filter, and output data. Sensu is a monitoring router that connects check scripts to handler scripts to alert or process monitoring data. Examples are given for each tool and what types of metrics to collect.

Building a Self-Service Big Data Pipeline

DataWorks Summit

Autodesk has built a large self-service big data pipeline to process large amounts of data from their various products and services on a daily basis. The pipeline ingests raw data, indexes it, aggregates and summarizes it over time, and makes it available to business users through various reporting and analytics tools. It processes over 2 billion transactions per day from many different data sources totaling over 800 terabytes of data.

Apache Tajo - BWC 2014

Gruter

Shortening the feedback loop

Josh Baer

Data-Driven Development Era and Its Technologies

SATOSHI TAGOMORI

This document discusses data-driven development and the technologies used in the data analytics process. It covers topics like data collection, storage, processing, and visualization. The document advocates using managed cloud services for data and analytics to focus on data instead of managing infrastructure. Choosing technologies should be based on the type of data and problems to solve, not the other way around. Services like Google BigQuery, Amazon Redshift, and Treasure Data are recommended for their ease of use.

How LinkedIn Democratizes Big Data Visualization

Chi-Yi Kuan

Speakers: Jonathan Wu (LinkedIn), Praveen Neppalli Naga (LinkedIn), Chi-Yi Kuan (LinkedIn) Category: Hadoop in Action LinkedIn processes enormous amounts of events each day. This data is of critical importance for data analysts, engineers, business experts, and data scientists that seek deep understanding of the interactions within LinkedIn’s professional social graph. They use this data to derive insights and performance metrics, which lead to better business decisions on products, marketing, sales, and other functional areas. Areas of interest include Email, Growth, Engagement, and Trending metrics. Development of internal tools has traditionally been based on specific need, optimized for the business use case, and non-interoperable. The engineering challenge is to allow business users to easily access and organize huge amounts of data in a comprehensive way and to be able to flexible and quickly get to the insights through graphs and charts that they need. The data needs to be sufficiently granular to work for different needs, the interface needs to be intuitive and simple, and the infrastructure needs to be high performance allowing users to manipulate large amounts of data quickly. The solution to this challenge was realized by the LinkedIn Business Analytics and Data Analytics Infrastructure teams utilizing an integrated stack that includes an interactive analytics infrastructure and a self-serve data visualization front-end solution. The user interface provides a customizable ability to build charts, tables, and queries to suit highly customized reporting needs on any devices. The back-end infrastructure is based on Hadoop; which leverages LinkedIn’s investment in high scalable, data rich systems. The combined solution brings the ability to visualize, slice, dice, and drill through billions of records and hundreds of dimensions at fast scale. In this talk, you will learn the background of the data challenges that LinkedIn faced, how the teams came together to construct the solution, and the underlying stack structure powering this solution.

Scaling to 1,000,000 concurrent users on the JVM

Pursuit Consulting

The document discusses strategies for scaling real-time applications to support 1 million concurrent users on the JVM. It recommends using microservices and embracing polyglot programming. It also provides examples of building blocks for distributed systems including consistent hashing, bloom filters, throttling with leaky bucket algorithms, and using Kafka for asynchronous data processing pipelines.

[B35] SAP国内外事例に見るビッグデータ・イノベーション by Ryo Saso

Insight Technology, Inc.

Spark Magic Building and Deploying a High Scale Product in 4 Months

tsliwowicz

This document summarizes Taboola's use of Spark to build their Newsroom product, a real-time analytics tool for content sites, in 4 months. Key points include: Taboola deployed Newsroom on a large Spark and Cassandra cluster to process 5TB of daily data and provide real-time recommendations, testing, and analytics. Newsroom aggregates data into batches and replays processing to ensure accurate counts. The system faced challenges around performance optimizations, debugging, and issues like keys being dependent on JVM state. Spark helped Taboola successfully deliver Newsroom and supports other uses like automatic campaign management.

PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase

DataWorks Summit

As one of the few closed-loop payment platforms, PayPal is uniquely positioned to provide merchants with insights aimed to identify opportunities to help grow and manage their business. PayPal processes billions of data events every day around our users, risk, payments, web behavior and identity. We are motivated to use this data to enable solutions to help our merchants maximize the number of successful transactions (checkout-conversion), better understand who their customers are and find additional opportunities to grow and attract new customers. As part of the Merchant Data Analytics, we have built a platform that serves low latency, scalable analytics and insights by leveraging some of the established and emerging platforms to best realize returns on the many business objectives at PayPal. Join us to learn more about how we leveraged platforms and technologies like Spark, Hive, Druid, Elastic Search and HBase to process large scale data for enabling impactful merchant solutions. We’ll share the architecture of our data pipelines, some real dashboards and the challenges involved. Speakers Kasiviswanathan Natarajan, Member of Technical Staff, PayPal Deepika Khera, Senior Manager - Merchant Data Analytics, PayPal

Observability at Spotify

Aleksandr Kuboskin, CFA

- Data observability is important for Spotify because they process massive amounts of data from 8 million events per second. - To ensure observability, Spotify annotates and documents their data schemas, monitors pipeline execution times and counts to check for errors, monitors financial costs of pipelines and storage, and sets up alerts and dashboards to monitor for failures. - Having good data observability helps Spotify understand where their data is coming from and going, troubleshoot issues quickly, and ensure royalty payments to artists are accurate since they rely on the data pipelines.

Big Data for the CMO

Bruno Aziza

The document discusses AlpineNow, a company that provides advanced analytics solutions for big data. It describes how AlpineNow allows for code-free, visual and collaborative analytics that reduce the time to insights from weeks/months to hours/days. Key features highlighted include automated data collection, self-serve visual exploration and analysis of entire datasets, and multi-user collaboration on models and projects.

Elastic Data Analytics Platform @Datadog

C4Media

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2l2Rr6L. Doug Daniels discusses the cloud-based platform they have built at DataDog and how it differs from a traditional datacenter-based analytics stack. He walks through the decisions they have made at each layer, covers the pros and cons of these decisions and discusses the tooling they have built. Filmed at qconsf.com. Doug Daniels is a Director of Engineering at Datadog, where he works on high-scale data systems for monitoring, data science, and analytics. Prior to joining Datadog, he was CTO at Mortar Data and an architect and developer at Wireless Generation, where he designed data systems to serve more than 4 million students in 49 states.

Trending with Purpose

Jason Dixon

This talk evalutes some easy ways to extract useful trending and capacity planning out of your existing monitoring investment. Using Nagios performance data, we examine simple behaviors with PNP4Nagios and graduate on to more insightful analytics with Graphite. With metrics in hand we look at the questions that IT /should/ be asking, such as: * What sort of data should I trend? * Why do I need to trend it? * How do Operational or Engineering trends relate to Business or Transactional monitoring? * How does this data impact our customer relationship and/or their bottom-line? Finally, we look at creative ways to get profiling data out of your production systems with a minimum amount of effort from your development team.

Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025

Nicola Sandoli

TIBCO provides an analytics platform that delivers business value across the analytics spectrum from descriptive to predictive to prescriptive analytics. The platform includes Spotfire for visual analytics, predictive analytics using R scripting, and real-time event processing capabilities. It can consume and analyze various data sources including big data. The platform enables different types of users from data scientists to analysts to business users.

How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021

StreamNative

1) Apache InLong is an open source data integration framework that provides automatic, secure, and reliable data transmission. It supports both batch and stream processing using different message queues like Apache Pulsar. 2) Apache Pulsar is used with Apache InLong because it offers very low latency, high throughput, reliable data transmission, and multi-tenancy. KoP allows migrating Kafka workloads to Pulsar. 3) Apache InLong contributes to Apache Pulsar through over 60 contributors and 50 pull requests to the KoP project. It uses Pulsar for auto disaster tolerance, multi-tenancy of data streams, and auditing data streams.

Similar to Productive data engineer (20)

Move out from AppEngine, and Python PaaS alternatives

Spotify in the Cloud - An evolution of data infrastructure - Strata NYC

Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...

Using ClickHouse for Experimentation

Open Source Monitoring Tools

Building a Self-Service Big Data Pipeline

Apache Tajo - BWC 2014

Shortening the feedback loop

Data-Driven Development Era and Its Technologies

How LinkedIn Democratizes Big Data Visualization

Scaling to 1,000,000 concurrent users on the JVM

[B35] SAP国内外事例に見るビッグデータ・イノベーション by Ryo Saso

Spark Magic Building and Deploying a High Scale Product in 4 Months

PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase

Observability at Spotify

Big Data for the CMO

Elastic Data Analytics Platform @Datadog

Trending with Purpose

Tibco Augmented Intelligence - Analytics, IoT, Big Data, Streaming 20161025

How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021

Recently uploaded

The Ipsos - AI - Monitor 2024 Report.pdf

Social Samosa

Population Growth in Bataan: The effects of population growth around rural pl...

Bill641377

一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理

74nqk8xf

毕业原版【微信:41543339】【(牛布毕业证书)牛津布鲁克斯大学毕业证】【微信:41543339】成绩单、外壳、offer、留信学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路），我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信41543339】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信41543339】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务 → 【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理

zsjl4mimo

毕业原版【微信:41543339】【(Harvard毕业证书)哈佛大学毕业证】【微信:41543339】成绩单、外壳、offer、留信学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路），我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信41543339】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信41543339】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务 → 【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理

bopyb

毕业原版【微信:176555708】【(GWU,GW毕业证书)乔治·华盛顿大学毕业证】【微信:176555708】成绩单、外壳、offer、留信学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路），我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信176555708】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信176555708】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务 → 【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理

nyfuhyz

毕业原版【微信:176555708】【(UMN毕业证书)明尼苏达大学毕业证】【微信:176555708】成绩单、外壳、offer、留信学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路），我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信176555708】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信176555708】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务 → 【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

一比一原版(UO毕业证)渥太华大学毕业证如何办理

aqzctr7x

UO毕业证录取书【微信95270640】购买（渥太华大学毕业证成绩单硕士学历）Q微信95270640代办UO学历认证留信网伪造渥太华大学学位证书精仿渥太华大学本科/硕士文凭证书补办渥太华大学 diplomaoffer,Transcript购买渥太华大学毕业证成绩单购买UO假毕业证学位证书购买伪造渥太华大学文凭证书学位证书,专业办理雅思、托福成绩单，学生ID卡，在读证明，海外各大学offer录取通知书，毕业证书，成绩单，文凭等材料:1:1完美还原毕业证、offer录取通知书、学生卡等各种在读或毕业材料的防伪工艺（包括烫金、烫银、钢印、底纹、凹凸版、水印、防伪光标、热敏防伪、文字图案浮雕，激光镭射，紫外荧光，温感光标）学校原版上有的工艺我们一样不会少，不论是老版本还是最新版本，都能保证最高程度还原，力争完美以求让所有同学都能享受到完美的品质服务。文凭办理流程： 1客户提供办理信息：姓名生日专业学位毕业时间等（如信息不确定可以咨询顾问：微信95270640我们有专业老师帮你查询）； 2开始安排制作毕业证成绩单电子图； 3毕业证成绩单电子版做好以后发送给您确认； 4毕业证成绩单电子版您确认信息无误之后安排制作成品； 5成品做好拍照或者视频给您确认； 6快递给客户（国内顺丰国外DHLUPS等快读邮寄）。 7完成交易删除客户资料高精端提供以下服务：一：渥太华大学渥太华大学毕业证文凭证书全套材料从防伪到印刷水印底纹到钢印烫金二：真实使馆认证（留学人员回国证明）使馆存档三：真实教育部认证教育部存档教育部留服网站可查四：留信认证留学生信息网站可查五：与学校颁发的相关证件1:1纸质尺寸制定（定期向各大院校毕业生购买最新版本毕,业证成绩单保证您拿到的是鲁昂大学内部最新版本毕业证成绩单微信95270640） A.为什么留学生需要操作留信认证? 留信认证全称全国留学生信息服务网认证,隶属于北京中科院。①留信认证门槛条件更低,费用更美丽,并且包过,完单周期短,效率高②留信认证虽然不能去国企,但是一般的公司都没有问题,因为国内很多公司连基本的留学生学历认证都不了解。这对于留学生来说,这就比自己光拿一个证书更有说服力,因为留学学历可以在留信网站上进行查询! B.为什么我们提供的毕业证成绩单具有使用价值？查询留服认证是国内鉴别留学生海外学历的唯一途径但认证只是个体行为不是所有留学生都操作所以没有办理认证的留学生的学历在国内也是查询不到的他们也仅仅只有一张文凭。所以这时候我们提供的和学校颁发的一模一样的毕业证成绩单就有了使用价值。只硕大的蛇皮袋手里拎着长铁钩正站在门口朝黑色的屋内张望不好坏人小偷山娃一怔却也灵机一动立马仰起头双手拢在嘴边朝楼上大喊：“爸爸爸——有人找——那人一听朝山娃尴尬地笑笑悻悻地走了山娃立马“嘭的一声将铁门锁死心却咚咚地乱跳当山娃跟父亲说起这事时父亲很吃惊抚摸着山娃的头说还好醒得及时要不家早被人掏空了到时连电视也没得看啰不过父亲还是夸山娃能临危不乱随机应变有胆有谋山娃笑笑说那都是书上学的看童话和小说时多

一比一原版(Chester毕业证书)切斯特大学毕业证如何办理

74nqk8xf

毕业原版【微信:41543339】【(Chester毕业证书)切斯特大学毕业证】【微信:41543339】成绩单、外壳、offer、留信学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路），我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信41543339】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信41543339】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务 → 【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...

Timothy Spann

在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样

v7oacc3l

学校原件一模一样【微信：741003700 】《(英国UCA毕业证书)创意艺术大学毕业证》【微信：741003700 】学位证，留信认证（真实可查，永久存档）原件一模一样纸张工艺/offer、雅思、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原。 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 【主营项目】一.毕业证【q微741003700】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证【q/微741003700】 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才

Learn SQL from basic queries to Advance queries

manishkhaire30

Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively. Key Highlights: Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation. Advanced Queries: Learn to craft complex queries to uncover deep insights from your data. Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets. Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios. Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making. Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data! #DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...

Aggregage

Challenges of Nation Building-1.pptx with more important

Sm321

Global Situational Awareness of A.I. and where its headed

vikram sood

You can see the future first in San Francisco. Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum. The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war. Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change. Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride. Let me tell you what we see.

The Building Blocks of QuestDB, a Time Series Database

javier ramirez

Talk Delivered at Valencia Codes Meetup 2024-06. Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds. It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.

一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理

nuttdpt

毕业原版【微信:176555708】【(UCSF毕业证书)旧金山分校毕业证】【微信:176555708】成绩单、外壳、offer、留信学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路），我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信176555708】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信176555708】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务 → 【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

Palo Alto Cortex XDR presentation .......

Sachin Paul

Influence of Marketing Strategy and Market Competition on Business Plan

jerlynmaetalle

STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...

sameer shah

A presentation that explain the Power BI Licensing

AlessioFois2

Recently uploaded (20)

The Ipsos - AI - Monitor 2024 Report.pdf

Population Growth in Bataan: The effects of population growth around rural pl...

一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理

一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理

一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理

一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理

一比一原版(UO毕业证)渥太华大学毕业证如何办理

一比一原版(Chester毕业证书)切斯特大学毕业证如何办理

06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...

在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样

Learn SQL from basic queries to Advance queries

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...

Challenges of Nation Building-1.pptx with more important

Global Situational Awareness of A.I. and where its headed

The Building Blocks of QuestDB, a Time Series Database

一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理

Palo Alto Cortex XDR presentation .......

Influence of Marketing Strategy and Market Competition on Business Plan

STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...

A presentation that explain the Power BI Licensing

Productive data engineer

1. How to Be Productive Data Engineer Rafal Wojdyla - rav@spotify.com Note: My views are my own and don't necessarily represent those of Spotify.

2. • Operations • Development • Organization • Culture

3. What is Spotify? For everyone: • Streaming Service • Launched in October 2008 • 60 Million Monthly Users • 15 Million Paid Subscribers + and for me: • 1.3K nodes Hadoop cluster

4. Automation

5. ME ADAM

6. Apache Ambari Cloudera Manager

7. + Puppet

8. Not Invented Here

9. Never Invented Here

10. Wild Wild West

11. Apache Bigtop

12. Enable log aggregation

13. To enable log aggregation yarn.log-aggregation-enable = true yarn.log-aggregation.retain-seconds = ?

14. + <property> + <name>yarn.log-aggregation-enable</name> + <value>true</value> + </property> + + <property> + <name>yarn.log-aggregation.retain-seconds</name> + <value>315569260</value> +  + </property>

15. Heap Memory used is 97%

16. Hellelephant

17. Custom logs • Profiling • Garbage collection

18. Right tool for the job

19.

20.

21. Right abstraction for the job

22. Scaling machines is easy, scaling people is hard

23. • Map split size • Number of reducers • HDFS data retention • User feedback (ongoing) Automation

24. Organization

25.

26. Ownerless

27. Ownerless Squad

28. Ownerless Squad Upgrades

29. Ownerless Squad Upgrades Getting there

30. Culture

31. Experiment Fail Fast Embrace Failure

32. Spark But we have tried! Non grata

33. Spark spark.storage.memoryFraction (0.6) spark.shuffle.memoryFraction (0.2) In shuffle heavy algorithms reduce cache fraction in favour of shuffle.

34. Spark spark.executor.heartbeatInterval (10K) spark.core.connection.ack.wait.timeout (60) Increase in case of long GC pauses.

35. Learnings • Operations  Automation • Development  Abstraction • Organization  Team • Culture  Experiment

36. Join the band Engineers wanted in NYC & Stockholm http://spotify.com/jobs

Editor's Notes

Hi – my name is Rafal and I’m an engineer at Spotify, in this presentation I will talk about how to be a productive data engineer. I will combine knowledge of multiple productive engineers at Spotify and touch different areas our your daily work life. I will use real world examples, failures, success stories – but mostly failures. So if you are or want to be a data engineer hopefully after this presentation every single one of you will take at least one ‘ahh’ moment with you – the moment when you learn something now. I hope this learning will improve you productivity , bring new feature to your infrastructure or maybe spark a discussion inside your team.
We will go through lessons of productive data engineer and cover 4 different areas – operations, development, organization and culture. So we will kinda work our way from low level admin tips and spectacular disasters, after hard core operation we will talk about development on Hadoop – what to avoid, how Spotify is overcoming huge problem of legacy Hadoop tools. After development part we will take a look on how organization structure can affect your productivity and how one can tackle this problem, and finish with ubiquitous culture – how culture can help with in being productive. So in a way we start with low level scope or you productivity, how the cluster itself can affect you, how operators can help you out, to talk later talk about your development decision, structure of company and finish with how environment can influence your work. There will be time for questions at the end so please keep them till the end.
Before we go deep into presentation let’s first talk about What spotify is. Spotify is a streaming service, launched in 2008 in beautiful Stockholm, Sweden. Current public numbers are that we have 60M monthly users, 15M subscribers. And what’s unique about Spotify service is that it can play a perfect song for every single moment, and some of this is powered through Hadoop which makes it even cooler! For my Spotify is also 1.3 Hadoop nodes – which is like a baby for a team of 4 people. A baby that is sometimes very frustrating, shit happens all the time and you have to wake up in the middle of the night and clean it up, but it’s our baby and we love it. Without further ado let’s move to the core of the topic and start with operations. If there’s one lesson that comes from operating hadoop clusters from a handful of nodes in the corner of the office to 1300 nodes – it will be AUTOMATION.
Automation is crucial – especially if talking about Hadoop. Hadoop is huge beast to manage, there is loads of moving parts, loads of new stuff coming in and there’s always a reason for hadoop clusters to go down for some reason. If hadoop was not enough there’s always something that your company will push on poor operators – whether it’s a new linux distro or maybe there’s a bug in libc and you need to restart all the daemons and so on and so on.
You want to be proactive and do as little as possible – without automation even coffee won’t help you. You want to be Adam on – be happy and work on new features, enhance hadoop – bring joy to hadoop users. You don’t want to be the poor operator on the left, focused primarily on putting out fires, exhausted. Btw this is the picture of Adam and me after 40 hours of hadoop ugrade from Hadoop 1 to 2 in 2013.
So how to reach good enough automation of your cluster – let’s take spotify as example first – Spotify started with hadoop in 2009, very early, then the was a couple of tiny expansions, a short episode of hadoop in EMR, and we went back to on premise with shinny new 60 nodes – at that point we had to make a decision on how to manage hadoop – and because back then CM was limited and Ambari didn’t exist and cause Spotify loves puppet we have decided to use Puppet for this use case. It was rather big effort and took time, during which had had to drop some work, put out fires and work on Puppet, but it was a great investment. Today after a few iteration we like our puppet – as an example – most recent ongoing expansion is rather easy – we name the machines using proper naming convention and puppet kicks in installs all the services, configuration and keeps the machines in normalized state – very very important piece of our infrastructure. But wait – the slide says something about Ambari and CM – yes – cause if we were to set up a cluster today, we would most likely evaluate at least these two solution. Like I said Spotify basically didn’t have a choice and we settled on puppet and we are happy about it right now, but there’s huge leverage you can gain out of using these tools, loads of features that you get out of the box, that we need to implement ourselves. So if you are considering building hadoop cluster – make sure to give these tools a good try, them may not solve all your issues and use cases but for sure will bring loads of value and in time your will get even more features just from community – which is great, and is something that we are missing.
That said – even if you decided to use Ambari of CM – mostly likely you will still need some kind of configuration management tool – whether it will be puppet, or chef or salt or whatever is your favorite – you will need one, there will always be some extra library that you need to install and configure or some user to create, ldap configuration and so on. There’s another interesting outcome of us building our own puppet infrastructure – we know exactly how our hadoop is configured – every single piece of it – which comes in handy in case of trouble shooting and so on. In this case we touched a little bit a problem of 3rd party solutions vs let’s implement our own tailored solutions. How many of you are aware of NIH problem?
I will argue that there’s a number of cases and teams where this problem occurs at Spotify. NIH problem is an nutshell when you undervalue 3rd party solutions and convince others to implement your own solutions – in most cases this is a huge problem. The lesson that we have learned is that you need to give external tools a try, experiment but don’t expect something to solve all your problems – preferable define metrics of acceptance priori evaluation of tools.
But what is actually very interesting in case of data areas is kinda sibling problem on NIH – which is NeIH – a problem described I believe by Michael O Church – it’s kinda opposite approach it’s when you overvalue 3rd party solutions, and end you in a messy place of glue implementation madness. There’s loads of great tools in BigData areas – not all of the work well with each other, not all of them do well what them meant to be doing. I urge you to be critical, sometimes implementation of your own tool or postponing a new, shinny framework from infrastructure may be a good thing to do – but it has to be a data driven decision that bring value. Think about this two problems, and ask yourself are there examples of such solutions at your company?
To illustrate this I will tell you a real story – a story of great failure and success at the end. So we had an external consultant at Spotify – and his goal was to certify our cluster – basically 4 days of looking at different corner of our infrastructure. First two days went really smooth, we went through our configuration, state of the cluster and so on, we could not find a way to improve our cluster easily – which made as feel like proud, cause you know, we have this world class hadoop expert over and he can’t find a way to improve our cluster right. But oh boy was that a big mistake – so on day number 3, we are sitting in a room, whole team and consultant and due to miscommunication and misconfiguration our standby NN and RM go down – but that is still fine, cause RM starts in minute or two, standby can start in background – but unfortunately during the troubleshooting by mistake we have killed our active NN – at this point basically whole infrastructure was down – at our scale that means about 2 hours of downtime. It was bad! But wait for day number 4 – so next day we are sitting in the room, again whole time, consultant but also our managers and we listen to consultant saying that our testing and deployment procedures are like Wild Wild West and we act like cowboys – it was hard to listen to but he was right and we knew it. Next thing we do, was to go to a room with the team and come up something too solve this issue, we came up with something that may be obvious – a preproduction cluster – a cluster made out of the same machine profile and almost identical configuration that we will use for testing. But how to test was a real question. We went into research mode and started reading and watching presentation – we were especially impressed by tool called HIT by Yahoo, so we contacted the creator, unfortunately there was no plan to open source it – but they gave us a nice tip – look at apache bigtop.
So apache bigtop primarily facilities building and deployment of hadoop distribution – but you can also us it in a slightly different way – you can point you bigtop at your preproduction cluster and use its smoke tests to test the infrastructure. So our current flow of testing and deployment is to first deploy to preproduction cluster, run bigtop tests – get instant feedback about the change, if feedback is fine deploy to production if not – there’s something wrong with the change and we know that before it is deployed to production. Some finding from using BigTop is that it’s actually very easy to extend so we were able to add smoke tests for our own tools like snakebite and luigi but also what is very important we also run some production workloads as part of smoke tests – which actually makes us feel sure about the change. So in case of Apache BigTop are problem was testing of hadoop infrastructure – even tho BigTop is not perfect for this it provides loads of value just out of the box and thus it’s a great example of preventing NIH problem.
As an operator there are many ways to help yourself and also delegate some of the work to developers themselves – one disabled by default, but great feature is log aggregation – how many of you have log aggregation enable on your cluster? Cool. So in a nutshell this feature will aggregate yarn logs from workers and store them on HDFS for inspection, very useful for troubleshooting. So most of you probably know how to enable it right?
It’s dead simple. But there’s one question – how long should we keep the logs for? So we thought about it for a while, talked with HW a little bit and since we have huge cluster why not story it for long time – maybe we will need these logs for some analytics etc.
This is our initial change to configuration – 10 years. Does anyone know what bad can happen if you do that?
If you run enough jobs after some time you will see something like this in your NN – and when you see something like this in your NN then you end up with hellelepahant!
It’s a situation when your hadoop cluster spectacularly goes down. What happen is that log aggregation will create many many files, very important consequence of many files on HDFS is growing heap size, and when you get out of memory on NN. The lesson is that it’s good idea to alert on heap usage for your master daemons but also understand your configuration and its consequences, keep yourself up to data on the changes in configuration, read code of hadoop configuration keys and how they are connected between each other – no all configuration parameters all documented.
While on the topic of log aggregation – it’s good to know that aggregation will take log directory and aggregate its content – so if you put extra information on task in there you can get aggregation for free thus likely bring value for developers – think about profiling and garbage collection logs. While talking about developer – let’s move to development and how it works at spotify.
There is a couple of interesting lessons about productive development – first arguable most important is to pick right tool for the job – what is the current most important value to bring. Let’s talk Spotify – in 2009 Spotify started with hadoop streaming as a supported framework for MR development – hadoop streaming basically enables your to implement MR jobs in languages different than java – for many years it was THE framework – because Spotify loved python and it enabled us to iterate faster thus provide knowledge for our business. Time as passing and our hadoop cluster was growing – in time we needed something different something better when it comes to performance but also maturity. After long evaluation and I encourage you to watch presentations by David Whiting about different frameworks, we have decided to use Apache Crunch as supported framework for batch MR. Why – a couple of reasons – first ease of testing and type safety.
This graph shows number of successful and failed jobs divided by framework for 6 month – and these are production jobs – as you can see two most popular frameworks are hadoop streaming and crunch – but the difference between failed and successful jobs is crucial. Crunch jobs act much better and have better testing. Type safety helps to discover problem at compile time and testing framework that comes with crunch that we were able to enhance with hadoop minicluster helps users to easily test their jobs – basically makes testing easy something that we missed for our hadoop streaming jobs. But performance is another thing -
On this graph we can see map throughput in for apache crunch and hadoop streaming – there’s huge difference and again we are talking about production jobs here. Crunch turns out to be on average 8 times faster, 75% of all crunch jobs are much faster than all hadoop streaming jobs. What is more interesting is that we actually see higher utilization of our cluster the more crunch jobs we see on the cluster – which makes us super happy.
Another thing that crunch provides is great abstraction – and that is another thing that productive developer need to keep in mind – pick the right abstraction for the jobs. In case of crunch we can start thinking in terms of high level operations like filter, groupby, joins and so on instead of old map/reduce legacy. This makes implementation more intuitive and simply pleasant – thus make developer experience much better. The interesting thing that we have observed is that higher abstraction may remove some of the opportunities for optimization thus it’s not as easy to implement the best performing job – but on the other hand it reduces problem with premature optimization but also on average performs really well – there’s very few people that actually know how to optimize pure MR jobs or Hadoop streaming jobs at Spotify – but average optimization that we get from crunch turns out to be really good as you could see on the performance graph.
We do have loads of nodes – and we have scaling machines nailed down, crunch scales very well. But there’s big problem that we currently have: scaling people. How you scale support and best practices – we constantly see problems with code repetition, HDFS mess, lack of data management, YARN resource contention – all this bring our productivity down. There’s not enough time to go through all of them but some of this problems we are trying to tackle with nothing different then our beloved automation. Let’s see examples:
We automate map split size calculation thus number of map tasks, but also number of reducers therefor number and size of output files – all this is done by estimation and historical data using our workflow manager Luigi – that I encourage you to take a look at! We about to finish our second iteration of HDFS retention policy that we automatically remove data therefor reduce HDFS usage and in long term hopefully reduce HDFS legacy mess. Another ongoing effort is second iteration of automatic user feedback – we already expose database with aggregated information about all MR jobs that our users can query and learn how their jobs are performing – but we also plan another iteration very simple iteration, focused on Crunch – that right after to workflow pipeline is done will provide user with instant feedback – memory usage, garbage collection and so on, very simple tweaks users can apply to improve their jobs – for example if a user gives a pipeline 8GB of memory for each task and after going though counters we see that tasks are actually using only max 3GB, instant feedback to reduce memory could improve multitenancy of your cluster thus improve productivity.
With that let’s talk how organization structure can improve your performance – but before that let’s take a look at this graph.
This graph show Hadoop availability by Quarter at Spotify, higher is of course better – ok – so let’s see what happen here:
First part is hadoop cluster being ownerless, it was best effort support by team of people that mostly didn’t even want to do operations of hadoop, therefor multiple days of downtime happen, and infrastructure was in bad shape, denormalized – overall terrible state to be in. But there was a light at the end of the tunnel – in Q3 we have decided to create a squad – 3 people focused solely on hadoop infrastructure.
There was instant right after squad was created – users were happy and infrastructure was getting in shape, one of the first decisions we have made was to move to yarn in Q4.
Q4 and beginnings of ‘14 we again saw drop in availability mostly due to huge upgrade – and it’s consequences thereafter. The upgrade itself took whole weekend, and after the upgrade we saw many issues and fires that we had to put out, during this time we were mostly reactive but also working on polishing our puppet manifests. Whole situation stabilized after most fires were gone and puppet was in good shape.
Our goal is too keep hadoop at 3 nines of availability – and we are getting there since Q2 2014, hadoop squad is receiving constant feedback from users and its common that hear that availability was drastically improved which improved productivity and overall experience, which is great and makes us want to work even harder to achieve better results. As you see there’s a small drop at the beginning on Q1 2015 – does anyone of you know why and can guess?
With that lets now talk about what surrounds us – the culture – I strongly believe the culture at Spotify has huge influence on productivity - there are tree main pillars of culture.
Experiment, fail fast and embrace failure. We love to experiment and we time to experiment whether it’s company wide hack week, R&D days – if only one wish to experiment there’s time to do that and there’s loads of curious people at Spotify there’s always something going on. The most successful data based experiments are Luigi – hadoop workflow manager and snakebite – pure python hdfs client – I encourage you to take a look at them. Fail fast - don’t be afraid to admit failure, keep it as part of learning process to the point of embracing it. Talk about your failures, share them publicly for example through presentations both internally and externally – it will make experimentation thus innovation flow much smoother. To back this up by example let’s talk about two most recent ongoing experiments.
Another experiment is Spark – it pretty much ongoing experiment that we come back to every now and then, but officially it’s not welcome on production cluster due to immaturity and poor multitenancy support – that said most recent releases are very promising and we are constantly playing with it and have high hopes for it, especially about most recent dynamic resource allocation feature. There’s not so much time left but I would like to share with you two important lessons from our evaluation of a heavy spark job.
First hint is about memory settings – there are two important settings that can improve stability of your heavy Spark jobs – memory available for caching – storage.memoryFraction and memory available for shuffle – shuffle.memoryFraction. The default settings are .6 and .2 leaving .2 for runtime. In our case we had a heave machine learning job that was doing almost terabyte of shuffle – but very little caching – initally we had issues with shuffle step, but reducing storage memory and leaving extra memory for shuffle and runtime improved stability.
Another issue that we hit was long GC pauses – thus executors would disappear which in turns triggers recomputation and in the end potentially application failure. After tweaking hearbeat interval and ack.wait timeout was saw improvement in stability and even tho GC pauses still occurred they were less harmful.
We will go through lessons of productive data engineer and cover 4 different areas – operations, development, organization and culture. So we will kinda work our way from low level admin tips and spectacular disasters, after hard core operation we will talk about development on Hadoop – what to avoid, how Spotify is overcoming huge problem of legacy Hadoop tools. After development part we will take a look on how organization structure can affect your productivity and how one can tackle this problem, and finish with obiqious culture – how you attitude can help with in being productive. Ok – let’s get this started with operations.
Ok – if you were sleep the whole time – please wake up now for a few minutes

Productive data engineer

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Productive data engineer

Similar to Productive data engineer (20)

Recently uploaded

Recently uploaded (20)

Productive data engineer

Editor's Notes