Hive at Last.fm

•

5 likes•717 views

This talk is about using Hive in practice. We will go through some of the specific use cases for which Hive is currently being used at Last.fm, highlighting its strengths and weaknesses along the way.

Technology Education

What is Last.fm?
A music community website, powered by scrobbling
that provides personalised radio.

We aggregate scrobbles. A single scrobble is the
smallest unit of music attention data.

1 scrobble = (track, artist, timestamp).

In numbers
• 40 million users visit the site every month
• 39 billion scrobbles (600 per second)
• 400k personalised radio stations per day

enter hadoop...

Hadoop cluster

• 44 nodes
• 8 cores per node
• 16 gig ram per node
• 4x 1TB 7200rpm disks per node

Hadoop what is it good for?

• Charts
• Reporting
• Corrections
• Site stats / metrics
• Neighbours
• Recommendations

But wait, can you tell us about <stuff/>?

• How many?
• When?
• Where?
• Who?
• Why? Why not?

Ad hoc questions

• We get them all the time.
• Questions are good things, but answers take up
time.
• We would typically write programs once, run
once.

enter Hive...

What is Hive?
"Hive is a data warehouse infrastructure built on
top of Hadoop"

You get an SQL-like language for queries.

Start queries from a shell, file, jdbc, thrift.

Why we chose Hive?

• SQL familiarity suits non data engineers.
• It integrates well with existing data sets.
• It worked.

eg: http://www.flickr.com/photos/lozzd/4203345000/

Example:

SELECT artistid, insertdate, count(1)
FROM scrobbles
WHERE (trackid = 10019 OR trackid = 368575614)
AND insertdate >= '2009-12-01'
AND insertdate <= '2009-12-31'
GROUP BY artistid, insertdate
ORDER BY artistid, insertdate;

Example:

Users that
scrobble
? Users that
use the radio

Example:

SELECT count(1) FROM scrobbles GROUP BY userid;

SELECT count(1) FROM radiologs GROUP BY userid;

SELECT count(1) FROM
radiologs r JOIN scrobbles s
ON r.userid = s.userid
GROUP BY r.userid;

Example:

Consider a user's scrobbles and radio listens for just one track
First scrobble!

Scrobbles

Radio

Time

Example:

SELECT r.userid, r.trackid, count(1)
FROM
(
SELECT userid, trackid, min(unixtime) as unixtime
FROM scrobbles GROUP BY userid, trackid
) s
JOIN
radiologs r
ON r.userid = s.userid AND r.trackid = r.trackid
WHERE s.unixtime < r.unixtime
GROUP BY r.userid, r.trackid

Other nice things about hive

• Joins are really really easy (most of the time).

Preparing a search index
The crowd Labels
cloud scrobbles

Charts corrections Catalogue

Hive

Solr

artists albums tracks

Not so great

• No recordio.
• Really huge joins can cause out of memory
exceptions.

This document discusses Python and the pandas library. It provides an overview of Python's history and advantages, such as being easy to learn and having a large standard library. It also discusses the major Python data analysis packages NumPy, SciPy, matplotlib, and pandas. Pandas allows importing data from various sources, manipulating datasets, and performing operations on labeled and indexed data. The document also covers using pandas with other tools like Spark, visualization with matplotlib, and IDEs and notebooks for Python development.

800万人の"食べたい"をHadoopで分散処理

Tatsuya Sasaki

Storm at Spotify

Neville Li

Scaling Your Team and Technology: The Agile Way - Erik Duindam

Avisi B.V.

The document discusses scaling a team and technology. It covers sorting algorithms like insertion sort and merge sort, with merge sort being faster at O(n log2 n) time. It notes that a fast server with a slow algorithm can be slower than a slow server with a fast algorithm. It emphasizes using common sense in technology choices over expensive hardware. It also stresses the importance of team culture, technology environment, and having a clear technical vision for building a scalable system.

Scio - Moving to Google Cloud, A Spotify Story

Neville Li

Talk at Philly ETE Apr 28 2017 We will talk about Spotify’s story of migrating our big data infrastructure to Google Cloud. Over the past year or so we moved away from maintaining our own 2500+ node Hadoop cluster to managed services in the cloud. We replaced two key components in our data processing stack, Hive and Scalding, with BigQuery and Scio and are able to iterate at a much faster speed. We will focus the technical aspect of Scio, a Scala API for Apache Beam and Google Cloud Dataflow and how it changed the way we process data.

Scala Data Pipelines @ Spotify

Neville Li

This document provides an overview of Scala data pipelines at Spotify. It discusses: - The speaker's background and Spotify's scale with over 75 million active users. - Spotify's music recommendation systems including Discover Weekly and personalized radio. - How Scala and frameworks like Scalding, Spark, and Crunch are used to build data pipelines for tasks like joins, aggregations, and machine learning algorithms. - Techniques for optimizing pipelines including distributed caching, bloom filters, and Parquet for efficient storage and querying of large datasets. - The speaker's success in migrating over 300 jobs from Python to Scala and growing the team of engineers building Scala pipelines at Spotify.

SparkR: Enabling Interactive Data Science at Scale on Hadoop

DataWorks Summit

SparkR enables interactive data science at scale on Hadoop by providing an R interface to Apache Spark. Some key points: - SparkR allows users to manipulate distributed datasets (RDDs) using familiar R operations like map, filter, reduceByKey. - It integrates R and Spark by running R code on Spark executors via JNI, allowing R scripts to process large datasets in parallel. - Examples show how to do tasks like word count and logistic regression on Spark using R code, demonstrating the ability to scale R for data science on big data.

The Elephant in the Cloud: A Quest for the Next Generation In this talk, I will go through the evolution of Hadoop and its ecosystem projects and will try to peer into the crystal ball to predict what may be coming down the pike. I will discuss various way of crunching the data on Hadoop (MapReduce, OpenMPI, Spark and various SQL engines) and how these tools compliment each other. Apache Hadoop is no longer just a faithful, open source, scalable implementation of two seminal papers that came out of Google 10 years ago. It has evolved into a project that provides the enterprises with a reliable layer for storing massive amounts of unstructured data (HDFS) while allowing different computational frameworks to leverage those datasets. The original computational framework (MapReduce) has evolved into a much more scalable set of general purpose cluster management APIs collectively known as YARN. With YARN underneath, MapReduce is still there to support batch-oriented computations, but it is no longer the only game in town. With OpenMPI, Spark, and Tez rapidly becoming available now is truly the most exciting time to be a developer in a Hadoop ecosystem. It is also the time when you don't have to be employed by Yahoo!, Facebook or EBay to have access to mind-blowing compute power. That power is a credit card and a pivotal.io account away from anybody on the planet. I will conclude by outlining some of the ongoing work that makes Hadoop and its ecosystem projects first class citizens in cloud environments based on the work that Pivotal engineers have done with integrating Hadoop into PivotalONE PaaS.

Kick-R: Get your own R instance with 36 cores on AWS

Kiwamu Okabe

AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask

Víctor Zabalza

# Talk given at PyCon UK 2017 The first step in any data-intensive project is understanding the available data. To this end, data scientists spend a significant part of their time carrying out data quality assessments and data exploration. In spite of this being a crucial step, it usually requires repeating a series of menial tasks before the data scientist gains an understanding ofthe dataset and can progress to the next steps in the project. In this talk I will detail the inner workings of a Python package that we have built which automates this drudge work, enables efficient data exploration, and kickstarts data science projects. A summary is generated for each dataset, including: - General information about the dataset, including data quality of each of the columns; - Distribution of each of the columns through statistics and plots (histogram, CDF, KDE), optionally grouped by other categorical variables; - 2D distribution between pairs of columns; - Correlation coefficient matrix for all numerical columns. Building this tool has provided a unique view into the full Python data stack, from the parallelised analysis of a dataframe within a Dask custom execution graph, to the interactive visualisation with Jupyter widgets and Plotly. During the talk, I will also introduce how Dask works, and demonstrate how to migrate data pipelines to take advantage of its scalable capabilities.

BDT201 AWS Data Pipeline - AWS re: Invent 2012

Amazon Web Services

Lens: Data exploration with Dask and Jupyter widgets

Víctor Zabalza

Lens is an open source Python library for automated data exploration of large datasets using Dask. It computes summary statistics and relationships between columns in a dataset. The results are serialized to JSON for interactive exploration through Jupyter widgets or a web UI. Dask allows the computations to run in parallel across a cluster for scalability. Lens integrates with the SherlockML platform to analyze all datasets uploaded.

Apache Spark: killer or savior of Apache Hadoop?

rhatr

Making sense of performance and identifying stragglers in Data Analytics Fram...

manish ranjan

Who’s Afraid of Graphs?

Codemotion

"Who’s Afraid of Graphs?" by David Ostrovsky Graphs are everywhere. Friended someone on Facebook? Graphs. Checked the best route to avoid traffic on Google Maps? Graphs. Those recruiters that keep spamming you with job offers on LinkedIn? They find you through graphs. We’re surrounded by problems that can be best represented and solved through graphs, and yet graph databases and processing frameworks remain an obscure niche accessible mainly to data scientists and academics. It’s time to right the injustice and bring graphs to the masses! This session is an introdution to Neo4j, OrientDB, GraphX, Giraph, and others.

Hadoop導入事例 in クックパッド

Tatsuya Sasaki

1. Hadoop is a framework for distributed processing of large datasets across clusters of computers. 2. Hadoop can be used to perform tasks like large-scale sorting and data analysis faster than with traditional databases like MySQL. 3. Example applications of Hadoop include processing web server logs, managing user profiles for a large website, and performing machine learning on massive datasets.

Sorry - How Bieber broke Google Cloud at Spotify

Neville Li

Talk at Scala Up North Jul 21 2017 We will talk about Spotify's story with Scala big data and our journey to migrate our entire data infrastructure to Google Cloud and how Justin Bieber contributed to breaking it. We'll talk about Scio, a Scala API for Apache Beam and Google Cloud Dataflow, and the technology behind it, including macros, algebird, chill and shapeless. There'll also be a live coding demo.

Spark - Alexis Seigneurin (English)

Alexis Seigneurin

This document provides an overview of Spark, including its history, use cases, architecture, and ecosystem. Some key points: - Spark is an open-source cluster computing framework that allows processing of large datasets in parallel across compute clusters. It was developed at UC Berkeley in 2009 and became a top-level Apache project in 2013. - Spark can be used for tasks like log analysis, text processing, analytics, search, and fraud detection on large datasets distributed across clusters. It offers APIs in Scala, Java, Python and can integrate with Hadoop ecosystem. - Spark uses Resilient Distributed Datasets (RDDs) as its basic abstraction, allowing data to be processed in parallel. Transformations on

Hadoop in Data Warehousing

Alexey Grigorev

SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Alton Alexander

Playlist Recommendations @ Spotify

Nikhil Tibrewal

Hadoop Overview & Architecture

EMC

This document provides an overview of Hadoop architecture. It discusses how Hadoop uses MapReduce and HDFS to process and store large datasets reliably across commodity hardware. MapReduce allows distributed processing of data through mapping and reducing functions. HDFS provides a distributed file system that stores data reliably in blocks across nodes. The document outlines components like the NameNode, DataNodes and how Hadoop handles failures transparently at scale.

Hadoop london

Yahoo Developer Network

Apache Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It provides reliable storage through its distributed file system and scalable processing through its MapReduce programming model. Yahoo! uses Hadoop extensively for applications like log analysis, content optimization, and computational advertising, processing over 6 petabytes of data across 40,000 machines daily.

RDA Bootcamp

Helen Linda

This document summarizes an agenda for an RDA bootcamp presenting at the Vermont Library Conference on May 21, 2013. The presentation introduces Resource Description and Access (RDA), the new cataloging standard that replaces AACR2. The agenda includes an introduction to RDA and why it was developed, the basics of RDA description and access points, and how to implement RDA in libraries' catalogs. Presenters will discuss how RDA provides a better way to describe materials and gives catalogers more flexibility. While RDA makes some changes, the core of cataloging will remain similar, and hybrid catalogs combining RDA and AACR2 records are acceptable. Attendees are encouraged not to panic about the transition to

Frontera распределенный робот для обхода веба в больших объемах / Александр С...

Ontico

В этом докладе я собираюсь поделиться нашим опытом обхода испанского интернета. Мы поставили перед собой задачу обойти около 600 тысяч веб-сайтов в зоне .es с целью сбора статистики об узлах и их размерах. Я расскажу об архитектуре робота, хранилища, проблемах, с которыми мы столкнулись при обходе, и их решении. Наше решение доступно в форме open source фреймворка Frontera. Фреймворк позволяет построить распределенного робота для скачивания страниц из Интернета в больших объемах в реальном времени. Также он может быть использован для построения сфокусированных роботов для выкачивания подмножества заранее известных веб-сайтов. Фреймворк предлагает: настраиваемое хранилище URL документов (RDBMS или Key Value), управление стратегиями обхода, абстракцию транспортного уровня, абстракцию модуля загрузки. Доклад построен в увлекательной форме: описание проблемы, решение и проблемы, которые возникли в ходе разработки решения.

Cloudera search

Mark Kerzner

Devops kc meetup_5_20_2013

Aaron Blythe

Yahoo compares Storm and Spark

Chicago Hadoop Users Group

Bobby Evans and Tom Graves, the engineering leads for Spark and Storm development at Yahoo will talk about how these technologies are used on Yahoo's grids and reasons why to use one or the other. Bobby Evans is the low latency data processing architect at Yahoo. He is a PMC member on many Apache projects including Storm, Hadoop, Spark, and Tez. His team is responsible for delivering Storm as a service to all of Yahoo and maintaining Spark on Yarn for Yahoo (Although Tom really does most of that work). Tom Graves a Senior Software Engineer on the Platform team at Yahoo. He is an Apache PMC member on Hadoop, Spark, and Tez. His team is responsible for delivering and maintaining Spark on Yarn for Yahoo.

Hadoop with Python

Donald Miner

The document discusses using Python with Hadoop frameworks. It introduces Hadoop Distributed File System (HDFS) and MapReduce, and how to use the mrjob library to write MapReduce jobs in Python. It also covers using Python with higher-level Hadoop frameworks like Pig, accessing HDFS with snakebite, and using Python clients for HBase and the PySpark API for the Spark framework. Key advantages discussed are Python's rich ecosystem and ability to access Hadoop frameworks.

What's hot

Elephant in the cloud

rhatr

Kick-R: Get your own R instance with 36 cores on AWS

Kiwamu Okabe

AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask

Víctor Zabalza

BDT201 AWS Data Pipeline - AWS re: Invent 2012

Amazon Web Services

Lens: Data exploration with Dask and Jupyter widgets

Víctor Zabalza

Apache Spark: killer or savior of Apache Hadoop?

rhatr

Making sense of performance and identifying stragglers in Data Analytics Fram...

manish ranjan

Who’s Afraid of Graphs?

Codemotion

Hadoop導入事例 in クックパッド

Tatsuya Sasaki

Sorry - How Bieber broke Google Cloud at Spotify

Neville Li

Spark - Alexis Seigneurin (English)

Alexis Seigneurin

Hadoop in Data Warehousing

Alexey Grigorev

SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Alton Alexander

What's hot (13)

Elephant in the cloud

Kick-R: Get your own R instance with 36 cores on AWS

AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask

BDT201 AWS Data Pipeline - AWS re: Invent 2012

Lens: Data exploration with Dask and Jupyter widgets

Apache Spark: killer or savior of Apache Hadoop?

Making sense of performance and identifying stragglers in Data Analytics Fram...

Who’s Afraid of Graphs?

Hadoop導入事例 in クックパッド

Sorry - How Bieber broke Google Cloud at Spotify

Spark - Alexis Seigneurin (English)

Hadoop in Data Warehousing

SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Similar to Hive at Last.fm

Playlist Recommendations @ Spotify

Nikhil Tibrewal

Hadoop Overview & Architecture

EMC

Hadoop london

Yahoo Developer Network

RDA Bootcamp

Helen Linda

Frontera распределенный робот для обхода веба в больших объемах / Александр С...

Ontico

Cloudera search

Mark Kerzner

Devops kc meetup_5_20_2013

Aaron Blythe

Yahoo compares Storm and Spark

Chicago Hadoop Users Group

Hadoop with Python

Donald Miner

Hadoop Overview kdd2011

Milind Bhandarkar

TriHUG Feb: Hive on spark

trihug

This document discusses Hive on Spark, which allows Apache Hive queries to run on Apache Spark. It provides background on Hive, Spark, and their limitations. Hive on Spark was developed by the Hive community to leverage Spark's more efficient execution while maintaining compatibility. Examples are given of how simple and join queries are translated from Hive operations to Spark transformations and actions. Improvements to Spark needed to better support Hive are also outlined. The author thanks contributors from various organizations working on Hive on Spark.

On the need for a W3C community group on RDF Stream Processing

PlanetData Network of Excellence

The document discusses the need for a W3C community group on RDF stream processing. It notes there is currently heterogeneity in RDF stream models, query languages, implementations, and operational semantics. The speaker proposes creating a W3C community group to better understand these differences, requirements, and potentially develop recommendations. The group's mission would be to define common models for producing, transmitting, and continuously querying RDF streams. The presentation provides examples of use cases and outlines a template for describing them to collect more cases to understand requirements.

OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...

Oscar Corcho

The Evolution of Hadoop at Spotify - Through Failures and Pain

Rafał Wojdyła

The quickest way to learn and evolve infrastructure is by encountering obstacles and being forced to overcome limitations that keep you inches away from project goals. At Spotify, we’ve encountered many of these obstacles and frustrations as we grew our Hadoop cluster from a few machines in an office closet aggregating played song events for financial reports, to our current 900 node cluster that plays a large role in many features that you see in our application today. Two members of Spotify’s Hadoop ‘squad’ will weave in war stories, failures, frustrations and lessons learned to describe the Hadoop/Big Data architecture at Spotify and talk about how that architecture has evolved. We’ll talk about how and why we use a number of tools, including Apache Falcon and Apache Bigtop to test changes; Apache Crunch, Scalding and Hive w/ Tez to build features and provide analytics; and Snakebite and Luigi, two in-house tools created to overcome common frustrations.

Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra

DataStax Academy

Evan Chan from Ooyala presents on integrating Apache Spark and Apache Cassandra for interactive analytics. He discusses how Ooyala uses Cassandra for analytics and is becoming a major Spark user. The talk focuses on using Spark to generate dynamic queries over Cassandra data, as precomputing all possible aggregates is infeasible at Ooyala's scale. Chan describes Ooyala's architecture that uses Spark to generate materialized views from Cassandra for fast querying, and demonstrates running queries over a Spark/Cassandra dataset.

Music Personalization : Real time Platforms.

Esh Vckay

1. The document discusses music personalization techniques at Spotify, including understanding users and music content, using collaborative filtering and latent vector models to make recommendations, and building real-time recommendation systems using Apache Storm. 2. It describes how Spotify uses machine learning techniques like matrix factorization and word2vec to generate latent vectors for users, songs, artists and playlists to measure similarity and make personalized recommendations at scale for its 75 million users. 3. The key challenges are processing huge amounts of data from 1 billion playlists and 1TB of logs daily to provide recommendations for each new user within 3 seconds and in real-time as listening behaviors change.

OCF.tw's talk about "Introduction to spark"

Giivee The

Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...

Paul Leclercq

Paris Spark Meetup - May 2017 Video : https://www.youtube.com/watch?v=w5Zd-1wIJrU AdHoc analysis of radio stations broadcasts stored in a parquet files with plain SQL, the dataframe API. The aim was to notice radio stations habits, differences and if radio stations brainwashing is a thing This talk's Databricks notebook can be found here : https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6937750999095841/3645330882010081/6197123402747553/latest.html

Practical Problem Solving with Apache Hadoop & Pig

Milind Bhandarkar

The document discusses a presentation about practical problem solving with Hadoop and Pig. It provides an agenda that covers introductions to Hadoop and Pig, including the Hadoop distributed file system, MapReduce, performance tuning, and examples. It discusses how Hadoop is used at Yahoo, including statistics on usage. It also provides examples of how Hadoop has been used for applications like log processing, search indexing, and machine learning.

Processing Large Graphs

Nishant Gandhi

This document discusses processing large graphs. It introduces graph processing with MapReduce and Apache Giraph. MapReduce algorithms for finding triangles and connected components in graphs are described. The limitations of MapReduce for graph processing are discussed. Alternative graph processing technologies including Neo4j, a graph database, are presented. A movie recommendation use case is demonstrated using Neo4j to find similar users and recommend unseen movies.

Similar to Hive at Last.fm (20)

Playlist Recommendations @ Spotify

Hadoop Overview & Architecture

Hadoop london

RDA Bootcamp

Frontera распределенный робот для обхода веба в больших объемах / Александр С...

Cloudera search

Devops kc meetup_5_20_2013

Yahoo compares Storm and Spark

Hadoop with Python

Hadoop Overview kdd2011

TriHUG Feb: Hive on spark

On the need for a W3C community group on RDF Stream Processing

OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...

The Evolution of Hadoop at Spotify - Through Failures and Pain

Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra

Music Personalization : Real time Platforms.

OCF.tw's talk about "Introduction to spark"

Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...

Practical Problem Solving with Apache Hadoop & Pig

Processing Large Graphs

More from Skills Matter

5 things cucumber is bad at by Richard Lawrence

Hive at Last.fm

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Similar to Hive at Last.fm

Similar to Hive at Last.fm (20)

More from Skills Matter

More from Skills Matter (20)

Recently uploaded

Recently uploaded (20)

Hive at Last.fm