I both love it and hate Hadoop - I love it because it provides and commodify easy to use, engineer friendly and scalable abstraction layer over cluster of machines. I hate it because of all the gotchas and vast knowledge it requires to be productive throughout the full Hadoop stack. In this talk I will focus, and share the knowledge necessary to be productive data engineer.
The Evolution of Hadoop at Spotify - Through Failures and PainRafał Wojdyła
The quickest way to learn and evolve infrastructure is by encountering obstacles and being forced to overcome limitations that keep you inches away from project goals. At Spotify, we’ve encountered many of these obstacles and frustrations as we grew our Hadoop cluster from a few machines in an office closet aggregating played song events for financial reports, to our current 900 node cluster that plays a large role in many features that you see in our application today.
Two members of Spotify’s Hadoop ‘squad’ will weave in war stories, failures, frustrations and lessons learned to describe the Hadoop/Big Data architecture at Spotify and talk about how that architecture has evolved.
We’ll talk about how and why we use a number of tools, including Apache Falcon and Apache Bigtop to test changes; Apache Crunch, Scalding and Hive w/ Tez to build features and provide analytics; and Snakebite and Luigi, two in-house tools created to overcome common frustrations.
How Apache Drives Music Recommendations At SpotifyJosh Baer
The slides go through the high-level process of generating personalized playlists for all Spotify's users, using Apache big data products extensively.
Presentation given at Apache: Big Data Europe conference on September 29th, 2015 in Budapest.
Discover how the world of big data is evolving and becoming faster, more reliable and better organized-- powering many of the cooler new features that you see in the client today!
The document summarizes lessons learned by Spotify about scaling infrastructure and operations. Some key points include: starting with letting experts handle data centers when small, streamlining procurement processes, treating capacity in standardized "pods", focusing infrastructure teams on platforms rather than individual services, implementing automated processes for configuration, provisioning and monitoring, and having individual product teams take on operational responsibilities for their own services with guidance from infrastructure teams. The presentation also covers specific scaling challenges faced with storage, networking, and resilience strategies like retry policies and load shedding.
1) At Spotify, big data is used to answer important questions from various stakeholders like how many times songs have been streamed, most popular artists, and streaming numbers for marketing purposes.
2) Data infrastructure at Spotify includes a large Hadoop cluster with over 6 petabytes of data used to generate insights from user activity logs and improve the product.
3) Answering tricky questions requires techniques like A/B testing and analyzing streaming patterns to determine viral songs or artist reactions to new releases. Data-driven decisions are made to personalize the user experience.
This document summarizes Neville Li's work at Spotify developing real-time data streaming applications using Storm. It describes Spotify's large data volumes, how Storm is used to process streaming data at Spotify, details of a social listening topology, and lessons learned around development processes, language choices, and deployment.
Apache Spark: killer or savior of Apache Hadoop?rhatr
The Big Boss(tm) has just OKed the first Hadoop cluster in the company. You are the guy in charge of analyzing petabytes of your company's valuable data using a combination of custom MapReduce jobs and SQL-on-Hadoop solutions. All of a sudden the web is full of articles telling you that Hadoop is dead, Spark has won and you should quit while you're still ahead. But should you?
The Elephant in the Cloud: A Quest for the Next Generation
In this talk, I will go through the evolution of Hadoop and its ecosystem projects and will try to peer into the crystal ball to predict what may be coming down the pike. I will discuss various way of crunching the data on Hadoop (MapReduce, OpenMPI, Spark and various SQL engines) and how these tools compliment each other.
Apache Hadoop is no longer just a faithful, open source, scalable implementation of two seminal papers that came out of Google 10 years ago. It has evolved into a project that provides the enterprises with a reliable layer for storing massive amounts of unstructured data (HDFS) while allowing different computational frameworks to leverage those datasets.
The original computational framework (MapReduce) has evolved into a much more scalable set of general purpose cluster management APIs collectively known as YARN. With YARN underneath, MapReduce is still there to support batch-oriented computations, but it is no longer the only game in town. With OpenMPI, Spark, and Tez rapidly becoming available now is truly the most exciting time to be a developer in a Hadoop ecosystem. It is also the time when you don't have to be employed by Yahoo!, Facebook or EBay to have access to mind-blowing compute power. That power is a credit card and a pivotal.io account away from anybody on the planet.
I will conclude by outlining some of the ongoing work that makes Hadoop and its ecosystem projects first class citizens in cloud environments based on the work that Pivotal engineers have done with integrating Hadoop into PivotalONE PaaS.
The Evolution of Hadoop at Spotify - Through Failures and PainRafał Wojdyła
The quickest way to learn and evolve infrastructure is by encountering obstacles and being forced to overcome limitations that keep you inches away from project goals. At Spotify, we’ve encountered many of these obstacles and frustrations as we grew our Hadoop cluster from a few machines in an office closet aggregating played song events for financial reports, to our current 900 node cluster that plays a large role in many features that you see in our application today.
Two members of Spotify’s Hadoop ‘squad’ will weave in war stories, failures, frustrations and lessons learned to describe the Hadoop/Big Data architecture at Spotify and talk about how that architecture has evolved.
We’ll talk about how and why we use a number of tools, including Apache Falcon and Apache Bigtop to test changes; Apache Crunch, Scalding and Hive w/ Tez to build features and provide analytics; and Snakebite and Luigi, two in-house tools created to overcome common frustrations.
How Apache Drives Music Recommendations At SpotifyJosh Baer
The slides go through the high-level process of generating personalized playlists for all Spotify's users, using Apache big data products extensively.
Presentation given at Apache: Big Data Europe conference on September 29th, 2015 in Budapest.
Discover how the world of big data is evolving and becoming faster, more reliable and better organized-- powering many of the cooler new features that you see in the client today!
The document summarizes lessons learned by Spotify about scaling infrastructure and operations. Some key points include: starting with letting experts handle data centers when small, streamlining procurement processes, treating capacity in standardized "pods", focusing infrastructure teams on platforms rather than individual services, implementing automated processes for configuration, provisioning and monitoring, and having individual product teams take on operational responsibilities for their own services with guidance from infrastructure teams. The presentation also covers specific scaling challenges faced with storage, networking, and resilience strategies like retry policies and load shedding.
1) At Spotify, big data is used to answer important questions from various stakeholders like how many times songs have been streamed, most popular artists, and streaming numbers for marketing purposes.
2) Data infrastructure at Spotify includes a large Hadoop cluster with over 6 petabytes of data used to generate insights from user activity logs and improve the product.
3) Answering tricky questions requires techniques like A/B testing and analyzing streaming patterns to determine viral songs or artist reactions to new releases. Data-driven decisions are made to personalize the user experience.
This document summarizes Neville Li's work at Spotify developing real-time data streaming applications using Storm. It describes Spotify's large data volumes, how Storm is used to process streaming data at Spotify, details of a social listening topology, and lessons learned around development processes, language choices, and deployment.
Apache Spark: killer or savior of Apache Hadoop?rhatr
The Big Boss(tm) has just OKed the first Hadoop cluster in the company. You are the guy in charge of analyzing petabytes of your company's valuable data using a combination of custom MapReduce jobs and SQL-on-Hadoop solutions. All of a sudden the web is full of articles telling you that Hadoop is dead, Spark has won and you should quit while you're still ahead. But should you?
The Elephant in the Cloud: A Quest for the Next Generation
In this talk, I will go through the evolution of Hadoop and its ecosystem projects and will try to peer into the crystal ball to predict what may be coming down the pike. I will discuss various way of crunching the data on Hadoop (MapReduce, OpenMPI, Spark and various SQL engines) and how these tools compliment each other.
Apache Hadoop is no longer just a faithful, open source, scalable implementation of two seminal papers that came out of Google 10 years ago. It has evolved into a project that provides the enterprises with a reliable layer for storing massive amounts of unstructured data (HDFS) while allowing different computational frameworks to leverage those datasets.
The original computational framework (MapReduce) has evolved into a much more scalable set of general purpose cluster management APIs collectively known as YARN. With YARN underneath, MapReduce is still there to support batch-oriented computations, but it is no longer the only game in town. With OpenMPI, Spark, and Tez rapidly becoming available now is truly the most exciting time to be a developer in a Hadoop ecosystem. It is also the time when you don't have to be employed by Yahoo!, Facebook or EBay to have access to mind-blowing compute power. That power is a credit card and a pivotal.io account away from anybody on the planet.
I will conclude by outlining some of the ongoing work that makes Hadoop and its ecosystem projects first class citizens in cloud environments based on the work that Pivotal engineers have done with integrating Hadoop into PivotalONE PaaS.
- Hadoop was created to allow processing of large datasets in a distributed, fault-tolerant manner. It was originally developed by Doug Cutting and Mike Cafarella at Nutch in response to the growing amounts of data and computational needs at Google and other companies.
- The core of Hadoop consists of Hadoop Distributed File System (HDFS) for storage and Hadoop MapReduce for distributed processing. It also includes utilities like Hadoop Common for file system access and other basic functionality.
- Hadoop's goals were to process multi-petabyte datasets across commodity hardware in a reliable, flexible and open source way. It assumes failures are expected and handles them to provide fault tolerance.
Danielle Jabin is a data engineer at Spotify who works on A/B testing infrastructure. She describes Spotify's big data landscape, which includes over 40 million active users generating 1.5 TB of compressed data per day. Spotify collects this user data using Kafka for high-volume data collection, processes it using Hadoop on a large cluster, and stores aggregates in databases like PostgreSQL and Cassandra for analytics and visualization.
Slides from a talk at a meetup organized by SF Scala at Spotify's San Francisco office. The slides present details of playlist recommendations at Spotify and how Spotify uses Scalding to develop robust and reliable pipelines to generate these recommendations.
Meetup details: http://www.meetup.com/SF-Scala/events/224430674/
Hadoop Summit 2016 presentation.
As Yahoo continues to grow and diversify its mobile products, its platform team faces an ever-expanding, sophisticated group of analysts and product managers, who demand faster answers: to measure key engagement metrics like Daily Active Users, app installs and user retention; to run longitudinal analysis; and to re-calculate metrics over rolling time windows. All of this, quickly enough not to need a coffee break while waiting for results. The optimal solution for this use-case would have to take into account raw performance, cost, security implications and ease of data management. Could the benefits of Hive, ORC and Tez, coupled with a good data design provide the performance our customers crave? Or would it make sense to use more nascent, off-grid querying systems? This talk will examine the efficacy of using Hive for large-scale mobile analytics. We will quantify Hive performance on a traditional, shared, multi-tenant Hadoop cluster, and compare it with more specialized analytics tools on a single-tenant cluster. We will also highlight which tuning parameters yield maximum benefits, and analyze the surprisingly ineffectual ones. Finally, we will detail several enhancements made by Yahoo's Hive team (in split calculation, stripe elimination and the metadata system) to successfully boost performance.
This document provides an overview of debugging Hive queries with Hadoop in the cloud. It discusses Altiscale's Hadoop as a Service platform and perspective as an operational service provider. It then covers Hadoop 2 architecture, debugging tools, accessing logs in Hadoop 2, the Hive and Hadoop architecture, Hive logs, common Hive issues and case studies on stuck jobs and missing directories. The document aims to help users better understand and troubleshoot Hive queries running on Hadoop clusters.
This is part of an introductory course to Big Data Tools for Artificial Intelligence. These slides introduce students to the use of Apache Pig as an ETL tool over Hadoop.
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
Study after study shows that data preparation and other data janitorial work consume 50-90% of most data scientists’ time. Apache Drill is a very promising tool which can help address this. Drill works with many different forms of “self describing data” and allows analysts to run ad-hoc queries in ANSI SQL against that data. Unlike HIVE or other SQL on Hadoop tools, Drill is not a wrapper for Map-Reduce and can scale to clusters of up to 10k nodes.
The document provides information about Hive and Pig, two frameworks for analyzing large datasets using Hadoop. It compares Hive and Pig, noting that Hive uses a SQL-like language called HiveQL to manipulate data, while Pig uses Pig Latin scripts and operates on data flows. The document also includes code examples demonstrating how to use basic operations in Hive and Pig like loading data, performing word counts, joins, and outer joins on sample datasets.
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...Yahoo Developer Network
Spark and Ignite are two of the most popular open source projects in the area of high-performance Big Data and Fast Data. But did you know that one of the best ways to boost performance for your next generation real-time applications is to use them together? In this session, Dmitriy Setrakyan, Apache Ignite Project Management Committee Chairman and co-founder and CPO at GridGain will explain in detail how IgniteRDD — an implementation of native Spark RDD and DataFrame APIs — shares the state of the RDD across other Spark jobs, applications and workers. Dmitriy will also demonstrate how IgniteRDD, with its advanced in-memory indexing capabilities, allows execution of SQL queries many times faster than native Spark RDDs or Data Frames. Don't miss this opportunity to learn from one of the experts how to use Spark and Ignite better together in your projects.
Speakers:
Dmitriy Setrakyan, is a founder and CPO at GridGain Systems. Dmitriy has been working with distributed architectures for over 15 years and has expertise in the development of various middleware platforms, financial trading systems, CRM applications and similar systems. Prior to GridGain, Dmitriy worked at eBay where he was responsible for the architecture of an add-serving system processing several billion hits a day. Currently Dmitriy also acts as PMC chair of Apache Ignite project.
From oracle to hadoop with Sqoop and other toolsGuy Harrison
This document discusses tools for transferring data between relational databases and Hadoop, focusing on Apache Sqoop. It describes how Sqoop was optimized for Oracle imports and exports, reducing database load by up to 99% and improving performance by 5-20x. It also outlines the goals of Sqoop 2 to improve usability, security, and extensibility through a REST API and by separating responsibilities.
This document discusses Hive Editor in Hue, an open source web interface for Hadoop that includes applications for Hive, Pig, Impala, Oozie, Solr, Sqoop, and HBase. The Hive Editor in Hue provides syntax highlighting, query autocomplete, live progress and logs for Hive queries and MapReduce jobs, the ability to work with multiple databases and statements, and features for saving, exporting, and sharing queries. It connects to HiveServer2 and supports Sentry for authorization. A demo of the Hive Editor and Metastore Browser is provided.
NYC HUG - Application Architectures with Apache Hadoopmarkgrover
This document summarizes Mark Grover's presentation on application architectures with Apache Hadoop. It discusses processing clickstream data from web logs using techniques like deduplication, filtering, and sessionization in Hadoop. Specifically, it describes how to implement sessionization in MapReduce by using the user's IP address and timestamp to group log lines into sessions in the reducer.
IPython Notebook as a Unified Data Science Interface for HadoopDataWorks Summit
This document discusses using IPython Notebook as a unified data science interface for Hadoop. It proposes that a unified environment needs: 1) mixed local and distributed processing via Apache Spark, 2) access to languages like Python via PySpark, 3) seamless SQL integration via SparkSQL, and 4) visualization and reporting via IPython Notebook. The document demonstrates this environment by exploring open payments data between doctors/hospitals and manufacturers.
This document provides guidance on sizing and configuring Apache Hadoop clusters. It recommends separating master nodes, which run processes like the NameNode and JobTracker, from slave nodes, which run DataNodes, TaskTrackers and RegionServers. For medium to large clusters it suggests 4 master nodes and the remaining nodes as slaves. The document outlines factors to consider for optimizing performance and cost like selecting balanced CPU, memory and disk configurations and using a "shared nothing" architecture with 1GbE or 10GbE networking. Redundancy is more important for master than slave nodes.
A glimpse of test automation in hadoop ecosystem by Deepika AcharyQA or the Highway
This document discusses test automation in the Hadoop ecosystem. It provides an overview of key components like HDFS, HBase, Kafka, and Solr. It then describes how to set up test automation for each component using Java libraries and classes. Automating tests provides advantages like creating a test framework, enabling gray box testing, running tests easily in batch mode, ensuring flexibility of test data, quickly finding bugs, and maintaining health of systems. The presentation concludes with key learnings around Big Data, Hadoop components, and how to approach automation.
The document provides an overview of Hadoop, an open-source software framework for distributed storage and processing of large datasets. It describes how Hadoop uses HDFS for distributed file storage across clusters and MapReduce for parallel processing of data. Key components of Hadoop include HDFS for storage, YARN for resource management, and MapReduce for distributed computing. The document also discusses some popular Hadoop distributions and real-world uses of Hadoop by companies.
Data Engineering with Spring, Hadoop and Hive Alex Silva
This presentation will outline the evolution of the monitoring data platform pipeline at Rackspace and explore the compute and data management challenges we have faced at this scale. We will focus on our use of Hadoop and Hive as data storage and transformation platforms while discussing the technology stack, key architectural decisions, observations and pitfalls encountered in building the pipeline.
Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. It presents a SQL-like interface for querying data stored in various databases and file systems that integrate with Hadoop. The document provides links to Hive documentation, tutorials, presentations and other resources for learning about and using Hive. It also includes a table describing common Hive CLI commands and their usage.
This document provides an overview of the Sqoop tool, which is used to transfer data between Hadoop and relational database servers. Sqoop can import data from databases into HDFS and export data from HDFS to databases. The document describes how Sqoop works, provides installation instructions, and outlines various Sqoop commands for import, export, jobs, code generation, and interacting with databases.
2016 data-science-salary-survey - O’Reilly Data ScienceAdam Rabinovitch
IN THIS FOURTH EDITION of the O’Reilly Data Science
Salary Survey, they analyzed input from 983 respondents
working in the data space, across a variety of industries—
representing 45 countries and 45 US states. Through the
results of our 64-question survey, we’ve explored which tools
data scientists, analysts, and engineers use, which tasks they
engage in, and of course—how much they make.
Key findings include:
• Python and Spark are among the tools that contribute
most to salary.
• Among those who code, the highest earners are the ones
who code the most.
• SQL, Excel, R and Python are the most commonly used
tools.
• Those who attend more meetings, earn more.
• Women make less than men, for doing the same thing.
• Country and US state GDP serves as a decent proxy for
geographic salary variation (not as a direct estimate, but
as an additional input for a model).
• The most salient division between tool and tasks usage
is between those who mostly use Excel, SQL, and a small
number of closed source tools—and those who use more
open source tools and spend more time coding.
• R is used across this division: even people who don’t code
much or use many open source tools, use R.
• A secondary division emerges among the coding half—
separating a younger, Python-heavy data scientist/analyst
group, from a more experienced data scientist/engineer
cohort that tends to use a high number of tools and earns
the highest salaries.
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...ryanorban
Data scientists, data engineers, and data businesspeople are critical to leveraging data in any organization. A common complaint from data science managers is that data scientists invest time prototyping algorithms, and throw them over a proverbial fence to engineers to implement, only to find the algorithms must be rebuilt from scratch to scale. This is a symptom of a broader ailment -- that data teams are often designed as functional silos without proper communication and planning.
This talk outlines a framework to build and organize a data team that produces better results, minimizes wasted effort among team members, and ships great data products.
- Hadoop was created to allow processing of large datasets in a distributed, fault-tolerant manner. It was originally developed by Doug Cutting and Mike Cafarella at Nutch in response to the growing amounts of data and computational needs at Google and other companies.
- The core of Hadoop consists of Hadoop Distributed File System (HDFS) for storage and Hadoop MapReduce for distributed processing. It also includes utilities like Hadoop Common for file system access and other basic functionality.
- Hadoop's goals were to process multi-petabyte datasets across commodity hardware in a reliable, flexible and open source way. It assumes failures are expected and handles them to provide fault tolerance.
Danielle Jabin is a data engineer at Spotify who works on A/B testing infrastructure. She describes Spotify's big data landscape, which includes over 40 million active users generating 1.5 TB of compressed data per day. Spotify collects this user data using Kafka for high-volume data collection, processes it using Hadoop on a large cluster, and stores aggregates in databases like PostgreSQL and Cassandra for analytics and visualization.
Slides from a talk at a meetup organized by SF Scala at Spotify's San Francisco office. The slides present details of playlist recommendations at Spotify and how Spotify uses Scalding to develop robust and reliable pipelines to generate these recommendations.
Meetup details: http://www.meetup.com/SF-Scala/events/224430674/
Hadoop Summit 2016 presentation.
As Yahoo continues to grow and diversify its mobile products, its platform team faces an ever-expanding, sophisticated group of analysts and product managers, who demand faster answers: to measure key engagement metrics like Daily Active Users, app installs and user retention; to run longitudinal analysis; and to re-calculate metrics over rolling time windows. All of this, quickly enough not to need a coffee break while waiting for results. The optimal solution for this use-case would have to take into account raw performance, cost, security implications and ease of data management. Could the benefits of Hive, ORC and Tez, coupled with a good data design provide the performance our customers crave? Or would it make sense to use more nascent, off-grid querying systems? This talk will examine the efficacy of using Hive for large-scale mobile analytics. We will quantify Hive performance on a traditional, shared, multi-tenant Hadoop cluster, and compare it with more specialized analytics tools on a single-tenant cluster. We will also highlight which tuning parameters yield maximum benefits, and analyze the surprisingly ineffectual ones. Finally, we will detail several enhancements made by Yahoo's Hive team (in split calculation, stripe elimination and the metadata system) to successfully boost performance.
This document provides an overview of debugging Hive queries with Hadoop in the cloud. It discusses Altiscale's Hadoop as a Service platform and perspective as an operational service provider. It then covers Hadoop 2 architecture, debugging tools, accessing logs in Hadoop 2, the Hive and Hadoop architecture, Hive logs, common Hive issues and case studies on stuck jobs and missing directories. The document aims to help users better understand and troubleshoot Hive queries running on Hadoop clusters.
This is part of an introductory course to Big Data Tools for Artificial Intelligence. These slides introduce students to the use of Apache Pig as an ETL tool over Hadoop.
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
Study after study shows that data preparation and other data janitorial work consume 50-90% of most data scientists’ time. Apache Drill is a very promising tool which can help address this. Drill works with many different forms of “self describing data” and allows analysts to run ad-hoc queries in ANSI SQL against that data. Unlike HIVE or other SQL on Hadoop tools, Drill is not a wrapper for Map-Reduce and can scale to clusters of up to 10k nodes.
The document provides information about Hive and Pig, two frameworks for analyzing large datasets using Hadoop. It compares Hive and Pig, noting that Hive uses a SQL-like language called HiveQL to manipulate data, while Pig uses Pig Latin scripts and operates on data flows. The document also includes code examples demonstrating how to use basic operations in Hive and Pig like loading data, performing word counts, joins, and outer joins on sample datasets.
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...Yahoo Developer Network
Spark and Ignite are two of the most popular open source projects in the area of high-performance Big Data and Fast Data. But did you know that one of the best ways to boost performance for your next generation real-time applications is to use them together? In this session, Dmitriy Setrakyan, Apache Ignite Project Management Committee Chairman and co-founder and CPO at GridGain will explain in detail how IgniteRDD — an implementation of native Spark RDD and DataFrame APIs — shares the state of the RDD across other Spark jobs, applications and workers. Dmitriy will also demonstrate how IgniteRDD, with its advanced in-memory indexing capabilities, allows execution of SQL queries many times faster than native Spark RDDs or Data Frames. Don't miss this opportunity to learn from one of the experts how to use Spark and Ignite better together in your projects.
Speakers:
Dmitriy Setrakyan, is a founder and CPO at GridGain Systems. Dmitriy has been working with distributed architectures for over 15 years and has expertise in the development of various middleware platforms, financial trading systems, CRM applications and similar systems. Prior to GridGain, Dmitriy worked at eBay where he was responsible for the architecture of an add-serving system processing several billion hits a day. Currently Dmitriy also acts as PMC chair of Apache Ignite project.
From oracle to hadoop with Sqoop and other toolsGuy Harrison
This document discusses tools for transferring data between relational databases and Hadoop, focusing on Apache Sqoop. It describes how Sqoop was optimized for Oracle imports and exports, reducing database load by up to 99% and improving performance by 5-20x. It also outlines the goals of Sqoop 2 to improve usability, security, and extensibility through a REST API and by separating responsibilities.
This document discusses Hive Editor in Hue, an open source web interface for Hadoop that includes applications for Hive, Pig, Impala, Oozie, Solr, Sqoop, and HBase. The Hive Editor in Hue provides syntax highlighting, query autocomplete, live progress and logs for Hive queries and MapReduce jobs, the ability to work with multiple databases and statements, and features for saving, exporting, and sharing queries. It connects to HiveServer2 and supports Sentry for authorization. A demo of the Hive Editor and Metastore Browser is provided.
NYC HUG - Application Architectures with Apache Hadoopmarkgrover
This document summarizes Mark Grover's presentation on application architectures with Apache Hadoop. It discusses processing clickstream data from web logs using techniques like deduplication, filtering, and sessionization in Hadoop. Specifically, it describes how to implement sessionization in MapReduce by using the user's IP address and timestamp to group log lines into sessions in the reducer.
IPython Notebook as a Unified Data Science Interface for HadoopDataWorks Summit
This document discusses using IPython Notebook as a unified data science interface for Hadoop. It proposes that a unified environment needs: 1) mixed local and distributed processing via Apache Spark, 2) access to languages like Python via PySpark, 3) seamless SQL integration via SparkSQL, and 4) visualization and reporting via IPython Notebook. The document demonstrates this environment by exploring open payments data between doctors/hospitals and manufacturers.
This document provides guidance on sizing and configuring Apache Hadoop clusters. It recommends separating master nodes, which run processes like the NameNode and JobTracker, from slave nodes, which run DataNodes, TaskTrackers and RegionServers. For medium to large clusters it suggests 4 master nodes and the remaining nodes as slaves. The document outlines factors to consider for optimizing performance and cost like selecting balanced CPU, memory and disk configurations and using a "shared nothing" architecture with 1GbE or 10GbE networking. Redundancy is more important for master than slave nodes.
A glimpse of test automation in hadoop ecosystem by Deepika AcharyQA or the Highway
This document discusses test automation in the Hadoop ecosystem. It provides an overview of key components like HDFS, HBase, Kafka, and Solr. It then describes how to set up test automation for each component using Java libraries and classes. Automating tests provides advantages like creating a test framework, enabling gray box testing, running tests easily in batch mode, ensuring flexibility of test data, quickly finding bugs, and maintaining health of systems. The presentation concludes with key learnings around Big Data, Hadoop components, and how to approach automation.
The document provides an overview of Hadoop, an open-source software framework for distributed storage and processing of large datasets. It describes how Hadoop uses HDFS for distributed file storage across clusters and MapReduce for parallel processing of data. Key components of Hadoop include HDFS for storage, YARN for resource management, and MapReduce for distributed computing. The document also discusses some popular Hadoop distributions and real-world uses of Hadoop by companies.
Data Engineering with Spring, Hadoop and Hive Alex Silva
This presentation will outline the evolution of the monitoring data platform pipeline at Rackspace and explore the compute and data management challenges we have faced at this scale. We will focus on our use of Hadoop and Hive as data storage and transformation platforms while discussing the technology stack, key architectural decisions, observations and pitfalls encountered in building the pipeline.
Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. It presents a SQL-like interface for querying data stored in various databases and file systems that integrate with Hadoop. The document provides links to Hive documentation, tutorials, presentations and other resources for learning about and using Hive. It also includes a table describing common Hive CLI commands and their usage.
This document provides an overview of the Sqoop tool, which is used to transfer data between Hadoop and relational database servers. Sqoop can import data from databases into HDFS and export data from HDFS to databases. The document describes how Sqoop works, provides installation instructions, and outlines various Sqoop commands for import, export, jobs, code generation, and interacting with databases.
2016 data-science-salary-survey - O’Reilly Data ScienceAdam Rabinovitch
IN THIS FOURTH EDITION of the O’Reilly Data Science
Salary Survey, they analyzed input from 983 respondents
working in the data space, across a variety of industries—
representing 45 countries and 45 US states. Through the
results of our 64-question survey, we’ve explored which tools
data scientists, analysts, and engineers use, which tasks they
engage in, and of course—how much they make.
Key findings include:
• Python and Spark are among the tools that contribute
most to salary.
• Among those who code, the highest earners are the ones
who code the most.
• SQL, Excel, R and Python are the most commonly used
tools.
• Those who attend more meetings, earn more.
• Women make less than men, for doing the same thing.
• Country and US state GDP serves as a decent proxy for
geographic salary variation (not as a direct estimate, but
as an additional input for a model).
• The most salient division between tool and tasks usage
is between those who mostly use Excel, SQL, and a small
number of closed source tools—and those who use more
open source tools and spend more time coding.
• R is used across this division: even people who don’t code
much or use many open source tools, use R.
• A secondary division emerges among the coding half—
separating a younger, Python-heavy data scientist/analyst
group, from a more experienced data scientist/engineer
cohort that tends to use a high number of tools and earns
the highest salaries.
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...ryanorban
Data scientists, data engineers, and data businesspeople are critical to leveraging data in any organization. A common complaint from data science managers is that data scientists invest time prototyping algorithms, and throw them over a proverbial fence to engineers to implement, only to find the algorithms must be rebuilt from scratch to scale. This is a symptom of a broader ailment -- that data teams are often designed as functional silos without proper communication and planning.
This talk outlines a framework to build and organize a data team that produces better results, minimizes wasted effort among team members, and ships great data products.
10 more lessons learned from building Machine Learning systemsXavier Amatriain
1. Machine learning applications at Quora include answer ranking, feed ranking, topic recommendations, user recommendations, and more. A variety of models are used including logistic regression, gradient boosted decision trees, neural networks, and matrix factorization.
2. Implicit signals like watching and clicking tend to be more useful than explicit signals like ratings. However, both implicit and explicit signals combined can better represent long-term goals.
3. The outputs of machine learning models will often become inputs to other models, so models need to be designed with this in mind to avoid issues like feedback loops.
From Digital Analytics to Insights: Data-Driven Decision Making & Changes in Consumer Trends to Effectively Develop Below-the-Line Campaigns / Guest speaking on Nov 25, 2015 at Asia Business Connect's Conference on
"Effective Below-the-Line Marketing Strategies"
This talk given at the Hadoop Summit in San Jose on June 28, 2016, analyzes a few major trends in Big Data analytics.
These are a few takeaways from this talk:
- Adopt Apache Beam for easier development and portability between Big Data Execution Engines.
- Adopt stream analytics for faster time to insight, competitive advantages and operational efficiency.
- Accelerate your Big Data applications with In-Memory open source tools.
- Adopt Rapid Application Development of Big Data applications: APIs, Notebooks, GUIs, Microservices…
- Have Machine Learning part of your strategy or passively watch your industry completely transformed!
- How to advance your strategy for hybrid integration between cloud and on-premise deployments?
Data Visualization 101: How to Design Charts and GraphsVisage
Learn to design effective charts and graphs.
Your data is only as good as your ability to understand and communicate it. The right visualization is essential to incite a desired action, whether from customers or colleagues. But most marketers aren’t mathematicians or adept at data visualization. Fortunately, you don’t need a PhD in statistics to crack the data visualization code.
QCon Rio - Machine Learning for EveryoneDhiana Deva
Já não são mais necessários supercomputadores e times de PhDs do MIT para a criação de modelos preditivos baseados em dados. Estamos presenciando inovações em Aprendizado de Máquina que estão tornando este campo cada vez mais acessível.
Esta palestra tem como objetivo desmistificar o aprendizado de máquina, através da exposição de conceitos e uso de uma série de tecnologias.
Serão abordados os tipos de problemas desta área(classificação, regressão, clusterização, redução de dimensionalidade, etc.), suas as etapas (normalização, treinamento, otimização, regularização, etc.) e seus algoritmos, desde regressão linear, k-means, passando por árvores de decisão e até redes neurais, sempre aplicadas a problemas reais.
Na palestra, também conheceremos ferramentas como Sckit-learn, Pandas, R, MATLAB e Amazon Machine Learning, além de uma forma para praticar e experimentar estas ideias através de competições como o Kaggle.
Este documento proporciona instrucciones sobre cómo elaborar esquemas de circuitos eléctricos sencillos e instalar un circuito eléctrico elemental. Explica los símbolos eléctricos básicos y los tipos de esquemas. Luego describe los pasos para recibir los materiales necesarios e instalar un circuito, incluida la comprobación de su funcionamiento e instalación de medidores.
El documento describe los pasos para crear un blog, incluyendo seleccionar una plantilla, agregar 5 notas, y publicar el primer comentario mientras espera más comentarios de lectores.
Why You Should Team Up and Make Friends: Your Professional Responsibilities W...Parsons Behle & Latimer
A presentation about the ethical and professional obligations when reviewing a potential personal injury matter and when associating with another firm on personal injury matters.
The presentation title and the presenter's contact information are provided. No other details about the content or purpose of the presentation are given in the brief document. It appears to be an introductory slide with only basic identifying information listed.
Final teaching listening june 2012 with recordingJust_Peachy44
The document discusses selective listening and provides tips for practicing it. Selective listening means focusing only on the specific information you need rather than trying to understand everything that is said. It can be used when learning costs, business hours, or processes. The tips include planning what information you need beforehand, listening multiple times if needed, taking notes, and asking the speaker to repeat important points. An example is provided of selectively listening to a movie theater recording to find the price of matinee tickets.
The document proposes a simpler and smarter user interface for phone applications called "Phone Pad" that uses drag and drop gestures instead of multiple buttons. Phone Pad would allow users to change their status by dragging a phone icon to different areas of the interface. Users could also answer calls, make calls, conference calls, transfer calls, and log out by interacting with the phone icon. The interface is intended to make phone applications easier to use compared to traditional button-based interfaces. The proposal is being prepared for patent application.
This document summarizes an analysis of global oil production capacity to the year 2020. Some key points:
1) Additional unrestricted global oil production of over 49 million barrels per day is targeted by 2020, but after adjusting for risks, the potential increase is estimated to be around 29 million barrels per day.
2) Factoring in depletion rates and reserve growth, the estimated net increase in global oil production capacity by 2020 is around 17.6 million barrels per day, bringing total capacity to around 110.6 million barrels per day.
3) The largest estimated increases in production capacity by 2020 come from Iraq, the United States, Canada, and Brazil. The U.S. increase is particularly significant due
Move out from AppEngine, and Python PaaS alternativestzang ms
This document discusses moving a podcast hosting application called MyAudioCast off of Google App Engine (GAE) and onto other Python platforms as a result of high costs and limitations. Some key points:
- MyAudioCast was running on GAE for over a year but costs were rising to $120/month due to high storage, bandwidth, and processing usage.
- Performance on GAE was poor with high error rates for operations like inserting logs and updating counters.
- Development was slowed by GAE limitations like long deployment times and inability to easily use common Python packages.
- The author chose to migrate MyAudioCast to the Linode VPS and Heroku PaaS for better pricing,
Spotify in the Cloud - An evolution of data infrastructure - Strata NYCJosh Baer
Slides from a presentation given by Alison Gilles and Josh Baer during StrataNYC 2017.
Covers the decision, challenge and strategy (technical, organizational, people) for converting Spotify's 2500 node Hadoop cluster's worth of data and processing to Google Cloud.
Finally, touches on Spotify's resulting infrastructure on GCP.
This document discusses using ClickHouse for experimentation and metrics at Spotify. It describes how Spotify built an experimentation platform using ClickHouse to provide teams interactive queries on granular metrics data with low latency. Key aspects include ingesting data from Google Cloud Storage to ClickHouse daily, defining metrics through a centralized catalog, and visualizing metrics and running queries using Superset connected to ClickHouse. The platform aims to reduce load on notebooks and BigQuery by serving common queries directly from ClickHouse.
This document summarizes and compares several open source monitoring tools: Nagios, Graphite, StatsD, Logstash, and Sensu. Nagios is introduced as a commonly used tool that some love and some find frustrating. Graphite is described as a tool for storing and graphing time-series data. StatsD aggregates counters and timers and sends them to backend services like Graphite. Logstash is an tool for managing logs and events that can input, filter, and output data. Sensu is a monitoring router that connects check scripts to handler scripts to alert or process monitoring data. Examples are given for each tool and what types of metrics to collect.
Autodesk has built a large self-service big data pipeline to process large amounts of data from their various products and services on a daily basis. The pipeline ingests raw data, indexes it, aggregates and summarizes it over time, and makes it available to business users through various reporting and analytics tools. It processes over 2 billion transactions per day from many different data sources totaling over 800 terabytes of data.
Apache Tajo: A Big Data Warehouse System on Hadoop
Presented by Jae-hwa Jeong, Apache Tajo committer and senior research engineer at Gruter, in Bigdata World Convention 2014 at Oct.23, Busan, Korea
Data-Driven Development Era and Its TechnologiesSATOSHI TAGOMORI
This document discusses data-driven development and the technologies used in the data analytics process. It covers topics like data collection, storage, processing, and visualization. The document advocates using managed cloud services for data and analytics to focus on data instead of managing infrastructure. Choosing technologies should be based on the type of data and problems to solve, not the other way around. Services like Google BigQuery, Amazon Redshift, and Treasure Data are recommended for their ease of use.
How LinkedIn Democratizes Big Data VisualizationChi-Yi Kuan
Speakers: Jonathan Wu (LinkedIn), Praveen Neppalli Naga (LinkedIn), Chi-Yi Kuan (LinkedIn)
Category: Hadoop in Action
LinkedIn processes enormous amounts of events each day. This data is of critical importance for data analysts, engineers, business experts, and data scientists that seek deep understanding of the interactions within LinkedIn’s professional social graph. They use this data to derive insights and performance metrics, which lead to better business decisions on products, marketing, sales, and other functional areas. Areas of interest include Email, Growth, Engagement, and Trending metrics. Development of internal tools has traditionally been based on specific need, optimized for the business use case, and non-interoperable. The engineering challenge is to allow business users to easily access and organize huge amounts of data in a comprehensive way and to be able to flexible and quickly get to the insights through graphs and charts that they need. The data needs to be sufficiently granular to work for different needs, the interface needs to be intuitive and simple, and the infrastructure needs to be high performance allowing users to manipulate large amounts of data quickly.
The solution to this challenge was realized by the LinkedIn Business Analytics and Data Analytics Infrastructure teams utilizing an integrated stack that includes an interactive analytics infrastructure and a self-serve data visualization front-end solution. The user interface provides a customizable ability to build charts, tables, and queries to suit highly customized reporting needs on any devices. The back-end infrastructure is based on Hadoop; which leverages LinkedIn’s investment in high scalable, data rich systems. The combined solution brings the ability to visualize, slice, dice, and drill through billions of records and hundreds of dimensions at fast scale.
In this talk, you will learn the background of the data challenges that LinkedIn faced, how the teams came together to construct the solution, and the underlying stack structure powering this solution.
The document discusses strategies for scaling real-time applications to support 1 million concurrent users on the JVM. It recommends using microservices and embracing polyglot programming. It also provides examples of building blocks for distributed systems including consistent hashing, bloom filters, throttling with leaky bucket algorithms, and using Kafka for asynchronous data processing pipelines.
1) SAP provides database, analytics, mobile, and cloud software and services including SAP HANA, Sybase IQ, SQL Anywhere, and Sybase ASE databases.
2) SAP works with customers in various industries like automotive, sports, and telecommunications to develop real-time analytics solutions using SAP HANA.
3) SAP continues to invest in research to advance its software and database technologies.
Spark Magic Building and Deploying a High Scale Product in 4 Monthstsliwowicz
This document summarizes Taboola's use of Spark to build their Newsroom product, a real-time analytics tool for content sites, in 4 months. Key points include: Taboola deployed Newsroom on a large Spark and Cassandra cluster to process 5TB of daily data and provide real-time recommendations, testing, and analytics. Newsroom aggregates data into batches and replays processing to ensure accurate counts. The system faced challenges around performance optimizations, debugging, and issues like keys being dependent on JVM state. Spark helped Taboola successfully deliver Newsroom and supports other uses like automatic campaign management.
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase DataWorks Summit
As one of the few closed-loop payment platforms, PayPal is uniquely positioned to provide merchants with insights aimed to identify opportunities to help grow and manage their business. PayPal processes billions of data events every day around our users, risk, payments, web behavior and identity. We are motivated to use this data to enable solutions to help our merchants maximize the number of successful transactions (checkout-conversion), better understand who their customers are and find additional opportunities to grow and attract new customers.
As part of the Merchant Data Analytics, we have built a platform that serves low latency, scalable analytics and insights by leveraging some of the established and emerging platforms to best realize returns on the many business objectives at PayPal.
Join us to learn more about how we leveraged platforms and technologies like Spark, Hive, Druid, Elastic Search and HBase to process large scale data for enabling impactful merchant solutions. We’ll share the architecture of our data pipelines, some real dashboards and the challenges involved.
Speakers
Kasiviswanathan Natarajan, Member of Technical Staff, PayPal
Deepika Khera, Senior Manager - Merchant Data Analytics, PayPal
- Data observability is important for Spotify because they process massive amounts of data from 8 million events per second.
- To ensure observability, Spotify annotates and documents their data schemas, monitors pipeline execution times and counts to check for errors, monitors financial costs of pipelines and storage, and sets up alerts and dashboards to monitor for failures.
- Having good data observability helps Spotify understand where their data is coming from and going, troubleshoot issues quickly, and ensure royalty payments to artists are accurate since they rely on the data pipelines.
The document discusses AlpineNow, a company that provides advanced analytics solutions for big data. It describes how AlpineNow allows for code-free, visual and collaborative analytics that reduce the time to insights from weeks/months to hours/days. Key features highlighted include automated data collection, self-serve visual exploration and analysis of entire datasets, and multi-user collaboration on models and projects.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2l2Rr6L.
Doug Daniels discusses the cloud-based platform they have built at DataDog and how it differs from a traditional datacenter-based analytics stack. He walks through the decisions they have made at each layer, covers the pros and cons of these decisions and discusses the tooling they have built. Filmed at qconsf.com.
Doug Daniels is a Director of Engineering at Datadog, where he works on high-scale data systems for monitoring, data science, and analytics. Prior to joining Datadog, he was CTO at Mortar Data and an architect and developer at Wireless Generation, where he designed data systems to serve more than 4 million students in 49 states.
This talk evalutes some easy ways to extract useful trending and capacity planning out of your existing monitoring investment. Using Nagios performance data, we examine simple behaviors with PNP4Nagios and graduate on to more insightful analytics with Graphite. With metrics in hand we look at the questions that IT /should/ be asking, such as:
* What sort of data should I trend?
* Why do I need to trend it?
* How do Operational or Engineering trends relate to Business or Transactional monitoring?
* How does this data impact our customer relationship and/or their bottom-line?
Finally, we look at creative ways to get profiling data out of your production systems with a minimum amount of effort from your development team.
TIBCO provides an analytics platform that delivers business value across the analytics spectrum from descriptive to predictive to prescriptive analytics. The platform includes Spotfire for visual analytics, predictive analytics using R scripting, and real-time event processing capabilities. It can consume and analyze various data sources including big data. The platform enables different types of users from data scientists to analysts to business users.
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021StreamNative
1) Apache InLong is an open source data integration framework that provides automatic, secure, and reliable data transmission. It supports both batch and stream processing using different message queues like Apache Pulsar.
2) Apache Pulsar is used with Apache InLong because it offers very low latency, high throughput, reliable data transmission, and multi-tenancy. KoP allows migrating Kafka workloads to Pulsar.
3) Apache InLong contributes to Apache Pulsar through over 60 contributors and 50 pull requests to the KoP project. It uses Pulsar for auto disaster tolerance, multi-tenancy of data streams, and auditing data streams.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
3. What is Spotify?
For everyone:
• Streaming Service
• Launched in October 2008
• 60 Million Monthly Users
• 15 Million Paid Subscribers
+ and for me:
• 1.3K nodes Hadoop cluster
Hi – my name is Rafal and I’m an engineer at Spotify, in this presentation I will talk about how to be a productive data engineer. I will combine knowledge of multiple productive engineers at Spotify and touch different areas our your daily work life. I will use real world examples, failures, success stories – but mostly failures. So if you are or want to be a data engineer hopefully after this presentation every single one of you will take at least one ‘ahh’ moment with you – the moment when you learn something now. I hope this learning will improve you productivity , bring new feature to your infrastructure or maybe spark a discussion inside your team.
We will go through lessons of productive data engineer and cover 4 different areas – operations, development, organization and culture. So we will kinda work our way from low level admin tips and spectacular disasters, after hard core operation we will talk about development on Hadoop – what to avoid, how Spotify is overcoming huge problem of legacy Hadoop tools. After development part we will take a look on how organization structure can affect your productivity and how one can tackle this problem, and finish with ubiquitous culture – how culture can help with in being productive. So in a way we start with low level scope or you productivity, how the cluster itself can affect you, how operators can help you out, to talk later talk about your development decision, structure of company and finish with how environment can influence your work. There will be time for questions at the end so please keep them till the end.
Before we go deep into presentation let’s first talk about What spotify is. Spotify is a streaming service, launched in 2008 in beautiful Stockholm, Sweden. Current public numbers are that we have 60M monthly users, 15M subscribers. And what’s unique about Spotify service is that it can play a perfect song for every single moment, and some of this is powered through Hadoop which makes it even cooler!
For my Spotify is also 1.3 Hadoop nodes – which is like a baby for a team of 4 people. A baby that is sometimes very frustrating, shit happens all the time and you have to wake up in the middle of the night and clean it up, but it’s our baby and we love it. Without further ado let’s move to the core of the topic and start with operations. If there’s one lesson that comes from operating hadoop clusters from a handful of nodes in the corner of the office to 1300 nodes – it will be AUTOMATION.
Automation is crucial – especially if talking about Hadoop. Hadoop is huge beast to manage, there is loads of moving parts, loads of new stuff coming in and there’s always a reason for hadoop clusters to go down for some reason. If hadoop was not enough there’s always something that your company will push on poor operators – whether it’s a new linux distro or maybe there’s a bug in libc and you need to restart all the daemons and so on and so on.
You want to be proactive and do as little as possible – without automation even coffee won’t help you. You want to be Adam on – be happy and work on new features, enhance hadoop – bring joy to hadoop users. You don’t want to be the poor operator on the left, focused primarily on putting out fires, exhausted. Btw this is the picture of Adam and me after 40 hours of hadoop ugrade from Hadoop 1 to 2 in 2013.
So how to reach good enough automation of your cluster – let’s take spotify as example first – Spotify started with hadoop in 2009, very early, then the was a couple of tiny expansions, a short episode of hadoop in EMR, and we went back to on premise with shinny new 60 nodes – at that point we had to make a decision on how to manage hadoop – and because back then CM was limited and Ambari didn’t exist and cause Spotify loves puppet we have decided to use Puppet for this use case. It was rather big effort and took time, during which had had to drop some work, put out fires and work on Puppet, but it was a great investment. Today after a few iteration we like our puppet – as an example – most recent ongoing expansion is rather easy – we name the machines using proper naming convention and puppet kicks in installs all the services, configuration and keeps the machines in normalized state – very very important piece of our infrastructure. But wait – the slide says something about Ambari and CM – yes – cause if we were to set up a cluster today, we would most likely evaluate at least these two solution. Like I said Spotify basically didn’t have a choice and we settled on puppet and we are happy about it right now, but there’s huge leverage you can gain out of using these tools, loads of features that you get out of the box, that we need to implement ourselves. So if you are considering building hadoop cluster – make sure to give these tools a good try, them may not solve all your issues and use cases but for sure will bring loads of value and in time your will get even more features just from community – which is great, and is something that we are missing.
That said – even if you decided to use Ambari of CM – mostly likely you will still need some kind of configuration management tool – whether it will be puppet, or chef or salt or whatever is your favorite – you will need one, there will always be some extra library that you need to install and configure or some user to create, ldap configuration and so on. There’s another interesting outcome of us building our own puppet infrastructure – we know exactly how our hadoop is configured – every single piece of it – which comes in handy in case of trouble shooting and so on. In this case we touched a little bit a problem of 3rd party solutions vs let’s implement our own tailored solutions. How many of you are aware of NIH problem?
I will argue that there’s a number of cases and teams where this problem occurs at Spotify. NIH problem is an nutshell when you undervalue 3rd party solutions and convince others to implement your own solutions – in most cases this is a huge problem. The lesson that we have learned is that you need to give external tools a try, experiment but don’t expect something to solve all your problems – preferable define metrics of acceptance priori evaluation of tools.
But what is actually very interesting in case of data areas is kinda sibling problem on NIH – which is NeIH – a problem described I believe by Michael O Church – it’s kinda opposite approach it’s when you overvalue 3rd party solutions, and end you in a messy place of glue implementation madness. There’s loads of great tools in BigData areas – not all of the work well with each other, not all of them do well what them meant to be doing. I urge you to be critical, sometimes implementation of your own tool or postponing a new, shinny framework from infrastructure may be a good thing to do – but it has to be a data driven decision that bring value. Think about this two problems, and ask yourself are there examples of such solutions at your company?
To illustrate this I will tell you a real story – a story of great failure and success at the end. So we had an external consultant at Spotify – and his goal was to certify our cluster – basically 4 days of looking at different corner of our infrastructure. First two days went really smooth, we went through our configuration, state of the cluster and so on, we could not find a way to improve our cluster easily – which made as feel like proud, cause you know, we have this world class hadoop expert over and he can’t find a way to improve our cluster right. But oh boy was that a big mistake – so on day number 3, we are sitting in a room, whole team and consultant and due to miscommunication and misconfiguration our standby NN and RM go down – but that is still fine, cause RM starts in minute or two, standby can start in background – but unfortunately during the troubleshooting by mistake we have killed our active NN – at this point basically whole infrastructure was down – at our scale that means about 2 hours of downtime. It was bad! But wait for day number 4 – so next day we are sitting in the room, again whole time, consultant but also our managers and we listen to consultant saying that our testing and deployment procedures are like Wild Wild West and we act like cowboys – it was hard to listen to but he was right and we knew it. Next thing we do, was to go to a room with the team and come up something too solve this issue, we came up with something that may be obvious – a preproduction cluster – a cluster made out of the same machine profile and almost identical configuration that we will use for testing. But how to test was a real question. We went into research mode and started reading and watching presentation – we were especially impressed by tool called HIT by Yahoo, so we contacted the creator, unfortunately there was no plan to open source it – but they gave us a nice tip – look at apache bigtop.
So apache bigtop primarily facilities building and deployment of hadoop distribution – but you can also us it in a slightly different way – you can point you bigtop at your preproduction cluster and use its smoke tests to test the infrastructure. So our current flow of testing and deployment is to first deploy to preproduction cluster, run bigtop tests – get instant feedback about the change, if feedback is fine deploy to production if not – there’s something wrong with the change and we know that before it is deployed to production. Some finding from using BigTop is that it’s actually very easy to extend so we were able to add smoke tests for our own tools like snakebite and luigi but also what is very important we also run some production workloads as part of smoke tests – which actually makes us feel sure about the change.
So in case of Apache BigTop are problem was testing of hadoop infrastructure – even tho BigTop is not perfect for this it provides loads of value just out of the box and thus it’s a great example of preventing NIH problem.
As an operator there are many ways to help yourself and also delegate some of the work to developers themselves – one disabled by default, but great feature is log aggregation – how many of you have log aggregation enable on your cluster? Cool. So in a nutshell this feature will aggregate yarn logs from workers and store them on HDFS for inspection, very useful for troubleshooting. So most of you probably know how to enable it right?
It’s dead simple. But there’s one question – how long should we keep the logs for? So we thought about it for a while, talked with HW a little bit and since we have huge cluster why not story it for long time – maybe we will need these logs for some analytics etc.
This is our initial change to configuration – 10 years. Does anyone know what bad can happen if you do that?
If you run enough jobs after some time you will see something like this in your NN – and when you see something like this in your NN then you end up with hellelepahant!
It’s a situation when your hadoop cluster spectacularly goes down. What happen is that log aggregation will create many many files, very important consequence of many files on HDFS is growing heap size, and when you get out of memory on NN. The lesson is that it’s good idea to alert on heap usage for your master daemons but also understand your configuration and its consequences, keep yourself up to data on the changes in configuration, read code of hadoop configuration keys and how they are connected between each other – no all configuration parameters all documented.
While on the topic of log aggregation – it’s good to know that aggregation will take log directory and aggregate its content – so if you put extra information on task in there you can get aggregation for free thus likely bring value for developers – think about profiling and garbage collection logs. While talking about developer – let’s move to development and how it works at spotify.
There is a couple of interesting lessons about productive development – first arguable most important is to pick right tool for the job – what is the current most important value to bring. Let’s talk Spotify – in 2009 Spotify started with hadoop streaming as a supported framework for MR development – hadoop streaming basically enables your to implement MR jobs in languages different than java – for many years it was THE framework – because Spotify loved python and it enabled us to iterate faster thus provide knowledge for our business. Time as passing and our hadoop cluster was growing – in time we needed something different something better when it comes to performance but also maturity. After long evaluation and I encourage you to watch presentations by David Whiting about different frameworks, we have decided to use Apache Crunch as supported framework for batch MR. Why – a couple of reasons – first ease of testing and type safety.
This graph shows number of successful and failed jobs divided by framework for 6 month – and these are production jobs – as you can see two most popular frameworks are hadoop streaming and crunch – but the difference between failed and successful jobs is crucial. Crunch jobs act much better and have better testing. Type safety helps to discover problem at compile time and testing framework that comes with crunch that we were able to enhance with hadoop minicluster helps users to easily test their jobs – basically makes testing easy something that we missed for our hadoop streaming jobs. But performance is another thing -
On this graph we can see map throughput in for apache crunch and hadoop streaming – there’s huge difference and again we are talking about production jobs here. Crunch turns out to be on average 8 times faster, 75% of all crunch jobs are much faster than all hadoop streaming jobs. What is more interesting is that we actually see higher utilization of our cluster the more crunch jobs we see on the cluster – which makes us super happy.
Another thing that crunch provides is great abstraction – and that is another thing that productive developer need to keep in mind – pick the right abstraction for the jobs. In case of crunch we can start thinking in terms of high level operations like filter, groupby, joins and so on instead of old map/reduce legacy. This makes implementation more intuitive and simply pleasant – thus make developer experience much better. The interesting thing that we have observed is that higher abstraction may remove some of the opportunities for optimization thus it’s not as easy to implement the best performing job – but on the other hand it reduces problem with premature optimization but also on average performs really well – there’s very few people that actually know how to optimize pure MR jobs or Hadoop streaming jobs at Spotify – but average optimization that we get from crunch turns out to be really good as you could see on the performance graph.
We do have loads of nodes – and we have scaling machines nailed down, crunch scales very well. But there’s big problem that we currently have: scaling people. How you scale support and best practices – we constantly see problems with code repetition, HDFS mess, lack of data management, YARN resource contention – all this bring our productivity down. There’s not enough time to go through all of them but some of this problems we are trying to tackle with nothing different then our beloved automation. Let’s see examples:
We automate map split size calculation thus number of map tasks, but also number of reducers therefor number and size of output files – all this is done by estimation and historical data using our workflow manager Luigi – that I encourage you to take a look at!
We about to finish our second iteration of HDFS retention policy that we automatically remove data therefor reduce HDFS usage and in long term hopefully reduce HDFS legacy mess.
Another ongoing effort is second iteration of automatic user feedback – we already expose database with aggregated information about all MR jobs that our users can query and learn how their jobs are performing – but we also plan another iteration very simple iteration, focused on Crunch – that right after to workflow pipeline is done will provide user with instant feedback – memory usage, garbage collection and so on, very simple tweaks users can apply to improve their jobs – for example if a user gives a pipeline 8GB of memory for each task and after going though counters we see that tasks are actually using only max 3GB, instant feedback to reduce memory could improve multitenancy of your cluster thus improve productivity.
With that let’s talk how organization structure can improve your performance – but before that let’s take a look at this graph.
This graph show Hadoop availability by Quarter at Spotify, higher is of course better – ok – so let’s see what happen here:
First part is hadoop cluster being ownerless, it was best effort support by team of people that mostly didn’t even want to do operations of hadoop, therefor multiple days of downtime happen, and infrastructure was in bad shape, denormalized – overall terrible state to be in. But there was a light at the end of the tunnel – in Q3 we have decided to create a squad – 3 people focused solely on hadoop infrastructure.
There was instant right after squad was created – users were happy and infrastructure was getting in shape, one of the first decisions we have made was to move to yarn in Q4.
Q4 and beginnings of ‘14 we again saw drop in availability mostly due to huge upgrade – and it’s consequences thereafter. The upgrade itself took whole weekend, and after the upgrade we saw many issues and fires that we had to put out, during this time we were mostly reactive but also working on polishing our puppet manifests. Whole situation stabilized after most fires were gone and puppet was in good shape.
Our goal is too keep hadoop at 3 nines of availability – and we are getting there since Q2 2014, hadoop squad is receiving constant feedback from users and its common that hear that availability was drastically improved which improved productivity and overall experience, which is great and makes us want to work even harder to achieve better results. As you see there’s a small drop at the beginning on Q1 2015 – does anyone of you know why and can guess?
With that lets now talk about what surrounds us – the culture – I strongly believe the culture at Spotify has huge influence on productivity - there are tree main pillars of culture.
Experiment, fail fast and embrace failure. We love to experiment and we time to experiment whether it’s company wide hack week, R&D days – if only one wish to experiment there’s time to do that and there’s loads of curious people at Spotify there’s always something going on. The most successful data based experiments are Luigi – hadoop workflow manager and snakebite – pure python hdfs client – I encourage you to take a look at them. Fail fast - don’t be afraid to admit failure, keep it as part of learning process to the point of embracing it. Talk about your failures, share them publicly for example through presentations both internally and externally – it will make experimentation thus innovation flow much smoother. To back this up by example let’s talk about two most recent ongoing experiments.
Another experiment is Spark – it pretty much ongoing experiment that we come back to every now and then, but officially it’s not welcome on production cluster due to immaturity and poor multitenancy support – that said most recent releases are very promising and we are constantly playing with it and have high hopes for it, especially about most recent dynamic resource allocation feature. There’s not so much time left but I would like to share with you two important lessons from our evaluation of a heavy spark job.
First hint is about memory settings – there are two important settings that can improve stability of your heavy Spark jobs – memory available for caching – storage.memoryFraction and memory available for shuffle – shuffle.memoryFraction. The default settings are .6 and .2 leaving .2 for runtime. In our case we had a heave machine learning job that was doing almost terabyte of shuffle – but very little caching – initally we had issues with shuffle step, but reducing storage memory and leaving extra memory for shuffle and runtime improved stability.
Another issue that we hit was long GC pauses – thus executors would disappear which in turns triggers recomputation and in the end potentially application failure. After tweaking hearbeat interval and ack.wait timeout was saw improvement in stability and even tho GC pauses still occurred they were less harmful.
We will go through lessons of productive data engineer and cover 4 different areas – operations, development, organization and culture. So we will kinda work our way from low level admin tips and spectacular disasters, after hard core operation we will talk about development on Hadoop – what to avoid, how Spotify is overcoming huge problem of legacy Hadoop tools. After development part we will take a look on how organization structure can affect your productivity and how one can tackle this problem, and finish with obiqious culture – how you attitude can help with in being productive.
Ok – let’s get this started with operations.
Ok – if you were sleep the whole time – please wake up now for a few minutes