This talk is about using Hive in practice. We will go through some of the specific use cases for which Hive is currently being used at Last.fm, highlighting its strengths and weaknesses along the way.
This document discusses Python and the pandas library. It provides an overview of Python's history and advantages, such as being easy to learn and having a large standard library. It also discusses the major Python data analysis packages NumPy, SciPy, matplotlib, and pandas. Pandas allows importing data from various sources, manipulating datasets, and performing operations on labeled and indexed data. The document also covers using pandas with other tools like Spark, visualization with matplotlib, and IDEs and notebooks for Python development.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It allows for the distributed processing of large datasets across clusters of nodes using simple programming models. Hadoop is highly scalable, running on thousands of nodes, and is designed to reliably handle failures at the hardware or software level.
This document summarizes Neville Li's work at Spotify developing real-time data streaming applications using Storm. It describes Spotify's large data volumes, how Storm is used to process streaming data at Spotify, details of a social listening topology, and lessons learned around development processes, language choices, and deployment.
Scaling Your Team and Technology: The Agile Way - Erik DuindamAvisi B.V.
The document discusses scaling a team and technology. It covers sorting algorithms like insertion sort and merge sort, with merge sort being faster at O(n log2 n) time. It notes that a fast server with a slow algorithm can be slower than a slow server with a fast algorithm. It emphasizes using common sense in technology choices over expensive hardware. It also stresses the importance of team culture, technology environment, and having a clear technical vision for building a scalable system.
Scio - Moving to Google Cloud, A Spotify StoryNeville Li
Talk at Philly ETE Apr 28 2017
We will talk about Spotify’s story of migrating our big data infrastructure to Google Cloud. Over the past year or so we moved away from maintaining our own 2500+ node Hadoop cluster to managed services in the cloud. We replaced two key components in our data processing stack, Hive and Scalding, with BigQuery and Scio and are able to iterate at a much faster speed. We will focus the technical aspect of Scio, a Scala API for Apache Beam and Google Cloud Dataflow and how it changed the way we process data.
This document provides an overview of Scala data pipelines at Spotify. It discusses:
- The speaker's background and Spotify's scale with over 75 million active users.
- Spotify's music recommendation systems including Discover Weekly and personalized radio.
- How Scala and frameworks like Scalding, Spark, and Crunch are used to build data pipelines for tasks like joins, aggregations, and machine learning algorithms.
- Techniques for optimizing pipelines including distributed caching, bloom filters, and Parquet for efficient storage and querying of large datasets.
- The speaker's success in migrating over 300 jobs from Python to Scala and growing the team of engineers building Scala pipelines at Spotify.
SparkR: Enabling Interactive Data Science at Scale on HadoopDataWorks Summit
SparkR enables interactive data science at scale on Hadoop by providing an R interface to Apache Spark. Some key points:
- SparkR allows users to manipulate distributed datasets (RDDs) using familiar R operations like map, filter, reduceByKey.
- It integrates R and Spark by running R code on Spark executors via JNI, allowing R scripts to process large datasets in parallel.
- Examples show how to do tasks like word count and logistic regression on Spark using R code, demonstrating the ability to scale R for data science on big data.
This document discusses Python and the pandas library. It provides an overview of Python's history and advantages, such as being easy to learn and having a large standard library. It also discusses the major Python data analysis packages NumPy, SciPy, matplotlib, and pandas. Pandas allows importing data from various sources, manipulating datasets, and performing operations on labeled and indexed data. The document also covers using pandas with other tools like Spark, visualization with matplotlib, and IDEs and notebooks for Python development.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It allows for the distributed processing of large datasets across clusters of nodes using simple programming models. Hadoop is highly scalable, running on thousands of nodes, and is designed to reliably handle failures at the hardware or software level.
This document summarizes Neville Li's work at Spotify developing real-time data streaming applications using Storm. It describes Spotify's large data volumes, how Storm is used to process streaming data at Spotify, details of a social listening topology, and lessons learned around development processes, language choices, and deployment.
Scaling Your Team and Technology: The Agile Way - Erik DuindamAvisi B.V.
The document discusses scaling a team and technology. It covers sorting algorithms like insertion sort and merge sort, with merge sort being faster at O(n log2 n) time. It notes that a fast server with a slow algorithm can be slower than a slow server with a fast algorithm. It emphasizes using common sense in technology choices over expensive hardware. It also stresses the importance of team culture, technology environment, and having a clear technical vision for building a scalable system.
Scio - Moving to Google Cloud, A Spotify StoryNeville Li
Talk at Philly ETE Apr 28 2017
We will talk about Spotify’s story of migrating our big data infrastructure to Google Cloud. Over the past year or so we moved away from maintaining our own 2500+ node Hadoop cluster to managed services in the cloud. We replaced two key components in our data processing stack, Hive and Scalding, with BigQuery and Scio and are able to iterate at a much faster speed. We will focus the technical aspect of Scio, a Scala API for Apache Beam and Google Cloud Dataflow and how it changed the way we process data.
This document provides an overview of Scala data pipelines at Spotify. It discusses:
- The speaker's background and Spotify's scale with over 75 million active users.
- Spotify's music recommendation systems including Discover Weekly and personalized radio.
- How Scala and frameworks like Scalding, Spark, and Crunch are used to build data pipelines for tasks like joins, aggregations, and machine learning algorithms.
- Techniques for optimizing pipelines including distributed caching, bloom filters, and Parquet for efficient storage and querying of large datasets.
- The speaker's success in migrating over 300 jobs from Python to Scala and growing the team of engineers building Scala pipelines at Spotify.
SparkR: Enabling Interactive Data Science at Scale on HadoopDataWorks Summit
SparkR enables interactive data science at scale on Hadoop by providing an R interface to Apache Spark. Some key points:
- SparkR allows users to manipulate distributed datasets (RDDs) using familiar R operations like map, filter, reduceByKey.
- It integrates R and Spark by running R code on Spark executors via JNI, allowing R scripts to process large datasets in parallel.
- Examples show how to do tasks like word count and logistic regression on Spark using R code, demonstrating the ability to scale R for data science on big data.
The Elephant in the Cloud: A Quest for the Next Generation
In this talk, I will go through the evolution of Hadoop and its ecosystem projects and will try to peer into the crystal ball to predict what may be coming down the pike. I will discuss various way of crunching the data on Hadoop (MapReduce, OpenMPI, Spark and various SQL engines) and how these tools compliment each other.
Apache Hadoop is no longer just a faithful, open source, scalable implementation of two seminal papers that came out of Google 10 years ago. It has evolved into a project that provides the enterprises with a reliable layer for storing massive amounts of unstructured data (HDFS) while allowing different computational frameworks to leverage those datasets.
The original computational framework (MapReduce) has evolved into a much more scalable set of general purpose cluster management APIs collectively known as YARN. With YARN underneath, MapReduce is still there to support batch-oriented computations, but it is no longer the only game in town. With OpenMPI, Spark, and Tez rapidly becoming available now is truly the most exciting time to be a developer in a Hadoop ecosystem. It is also the time when you don't have to be employed by Yahoo!, Facebook or EBay to have access to mind-blowing compute power. That power is a credit card and a pivotal.io account away from anybody on the planet.
I will conclude by outlining some of the ongoing work that makes Hadoop and its ecosystem projects first class citizens in cloud environments based on the work that Pivotal engineers have done with integrating Hadoop into PivotalONE PaaS.
Kick-R: Get your own R instance with 36 cores on AWSKiwamu Okabe
The document describes a solution called Kick-R that allows running R scripts with 36 cores on AWS to speed up processing. It demonstrates Kick-R running a random forest script on sample spam data over 10 times faster by leveraging 36 AWS cores compared to running locally on a laptop. Instructions are provided on building, running, and cleaning up the Kick-R environment.
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskVíctor Zabalza
# Talk given at PyCon UK 2017
The first step in any data-intensive project is understanding the available data. To this end, data scientists spend a significant part of their time carrying out data quality assessments and data exploration. In spite of this being a crucial step, it usually requires repeating a series of menial tasks before the data scientist gains an understanding ofthe dataset and can progress to the next steps in the project.
In this talk I will detail the inner workings of a Python package that we have built which automates this drudge work, enables efficient data exploration, and kickstarts data science projects. A summary is generated for each dataset, including:
- General information about the dataset, including data quality of each of the columns;
- Distribution of each of the columns through statistics and plots (histogram, CDF, KDE), optionally grouped by other categorical variables;
- 2D distribution between pairs of columns;
- Correlation coefficient matrix for all numerical columns.
Building this tool has provided a unique view into the full Python data stack, from the parallelised analysis of a dataframe within a Dask custom execution graph, to the interactive visualisation with Jupyter widgets and Plotly. During the talk, I will also introduce how Dask works, and demonstrate how to migrate data pipelines to take advantage of its scalable capabilities.
In this session, we'll review the features and architecture of the new AWS Data Pipeline service and explain how you can use it to better manage your data-driven workloads. We'll then go over a few examples of setting up and provisioning a pipeline in the system.
Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza
Lens is an open source Python library for automated data exploration of large datasets using Dask. It computes summary statistics and relationships between columns in a dataset. The results are serialized to JSON for interactive exploration through Jupyter widgets or a web UI. Dask allows the computations to run in parallel across a cluster for scalability. Lens integrates with the SherlockML platform to analyze all datasets uploaded.
Apache Spark: killer or savior of Apache Hadoop?rhatr
The Big Boss(tm) has just OKed the first Hadoop cluster in the company. You are the guy in charge of analyzing petabytes of your company's valuable data using a combination of custom MapReduce jobs and SQL-on-Hadoop solutions. All of a sudden the web is full of articles telling you that Hadoop is dead, Spark has won and you should quit while you're still ahead. But should you?
"Who’s Afraid of Graphs?" by David Ostrovsky
Graphs are everywhere. Friended someone on Facebook? Graphs. Checked the best route to avoid traffic on Google Maps? Graphs. Those recruiters that keep spamming you with job offers on LinkedIn? They find you through graphs. We’re surrounded by problems that can be best represented and solved through graphs, and yet graph databases and processing frameworks remain an obscure niche accessible mainly to data scientists and academics. It’s time to right the injustice and bring graphs to the masses! This session is an introdution to Neo4j, OrientDB, GraphX, Giraph, and others.
1. Hadoop is a framework for distributed processing of large datasets across clusters of computers.
2. Hadoop can be used to perform tasks like large-scale sorting and data analysis faster than with traditional databases like MySQL.
3. Example applications of Hadoop include processing web server logs, managing user profiles for a large website, and performing machine learning on massive datasets.
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
Talk at Scala Up North Jul 21 2017
We will talk about Spotify's story with Scala big data and our journey to migrate our entire data infrastructure to Google Cloud and how Justin Bieber contributed to breaking it. We'll talk about Scio, a Scala API for Apache Beam and Google Cloud Dataflow, and the technology behind it, including macros, algebird, chill and shapeless. There'll also be a live coding demo.
This document provides an overview of Spark, including its history, use cases, architecture, and ecosystem. Some key points:
- Spark is an open-source cluster computing framework that allows processing of large datasets in parallel across compute clusters. It was developed at UC Berkeley in 2009 and became a top-level Apache project in 2013.
- Spark can be used for tasks like log analysis, text processing, analytics, search, and fraud detection on large datasets distributed across clusters. It offers APIs in Scala, Java, Python and can integrate with Hadoop ecosystem.
- Spark uses Resilient Distributed Datasets (RDDs) as its basic abstraction, allowing data to be processed in parallel. Transformations on
Slides from a talk at a meetup organized by SF Scala at Spotify's San Francisco office. The slides present details of playlist recommendations at Spotify and how Spotify uses Scalding to develop robust and reliable pipelines to generate these recommendations.
Meetup details: http://www.meetup.com/SF-Scala/events/224430674/
This document provides an overview of Hadoop architecture. It discusses how Hadoop uses MapReduce and HDFS to process and store large datasets reliably across commodity hardware. MapReduce allows distributed processing of data through mapping and reducing functions. HDFS provides a distributed file system that stores data reliably in blocks across nodes. The document outlines components like the NameNode, DataNodes and how Hadoop handles failures transparently at scale.
Apache Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It provides reliable storage through its distributed file system and scalable processing through its MapReduce programming model. Yahoo! uses Hadoop extensively for applications like log analysis, content optimization, and computational advertising, processing over 6 petabytes of data across 40,000 machines daily.
This document summarizes an agenda for an RDA bootcamp presenting at the Vermont Library Conference on May 21, 2013. The presentation introduces Resource Description and Access (RDA), the new cataloging standard that replaces AACR2. The agenda includes an introduction to RDA and why it was developed, the basics of RDA description and access points, and how to implement RDA in libraries' catalogs. Presenters will discuss how RDA provides a better way to describe materials and gives catalogers more flexibility. While RDA makes some changes, the core of cataloging will remain similar, and hybrid catalogs combining RDA and AACR2 records are acceptable. Attendees are encouraged not to panic about the transition to
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Ontico
В этом докладе я собираюсь поделиться нашим опытом обхода испанского интернета. Мы поставили перед собой задачу обойти около 600 тысяч веб-сайтов в зоне .es с целью сбора статистики об узлах и их размерах. Я расскажу об архитектуре робота, хранилища, проблемах, с которыми мы столкнулись при обходе, и их решении.
Наше решение доступно в форме open source фреймворка Frontera. Фреймворк позволяет построить распределенного робота для скачивания страниц из Интернета в больших объемах в реальном времени. Также он может быть использован для построения сфокусированных роботов для выкачивания подмножества заранее известных веб-сайтов.
Фреймворк предлагает: настраиваемое хранилище URL документов (RDBMS или Key Value), управление стратегиями обхода, абстракцию транспортного уровня, абстракцию модуля загрузки.
Доклад построен в увлекательной форме: описание проблемы, решение и проблемы, которые возникли в ходе разработки решения.
- Cloudera Search provides an overview of using Solr on Hadoop for search capabilities.
- Key projects involved include Lucene, Solr, and Hadoop which can be integrated to allow indexing of data on HDFS and querying via search.
- The presentation discusses architectural details of running Solr on HDFS and integrating other Hadoop projects like HBase, MapReduce, and Hue.
This document contains the agenda for the Kansas City DevOps Meetup on December 5, 2012. The agenda includes presentations on Google Fiberspace and DevOps logistics by Aaron from Cerner and Stathy from OpsCode. It also discusses deciding on topics and volunteers for future meetups, with suggestions like infrastructure as code, continuous deployment, and experience sharing.
Bobby Evans and Tom Graves, the engineering leads for Spark and Storm development at Yahoo will talk about how these technologies are used on Yahoo's grids and reasons why to use one or the other.
Bobby Evans is the low latency data processing architect at Yahoo. He is a PMC member on many Apache projects including Storm, Hadoop, Spark, and Tez. His team is responsible for delivering Storm as a service to all of Yahoo and maintaining Spark on Yarn for Yahoo (Although Tom really does most of that work).
Tom Graves a Senior Software Engineer on the Platform team at Yahoo. He is an Apache PMC member on Hadoop, Spark, and Tez. His team is responsible for delivering and maintaining Spark on Yarn for Yahoo.
The document discusses using Python with Hadoop frameworks. It introduces Hadoop Distributed File System (HDFS) and MapReduce, and how to use the mrjob library to write MapReduce jobs in Python. It also covers using Python with higher-level Hadoop frameworks like Pig, accessing HDFS with snakebite, and using Python clients for HBase and the PySpark API for the Spark framework. Key advantages discussed are Python's rich ecosystem and ability to access Hadoop frameworks.
The Elephant in the Cloud: A Quest for the Next Generation
In this talk, I will go through the evolution of Hadoop and its ecosystem projects and will try to peer into the crystal ball to predict what may be coming down the pike. I will discuss various way of crunching the data on Hadoop (MapReduce, OpenMPI, Spark and various SQL engines) and how these tools compliment each other.
Apache Hadoop is no longer just a faithful, open source, scalable implementation of two seminal papers that came out of Google 10 years ago. It has evolved into a project that provides the enterprises with a reliable layer for storing massive amounts of unstructured data (HDFS) while allowing different computational frameworks to leverage those datasets.
The original computational framework (MapReduce) has evolved into a much more scalable set of general purpose cluster management APIs collectively known as YARN. With YARN underneath, MapReduce is still there to support batch-oriented computations, but it is no longer the only game in town. With OpenMPI, Spark, and Tez rapidly becoming available now is truly the most exciting time to be a developer in a Hadoop ecosystem. It is also the time when you don't have to be employed by Yahoo!, Facebook or EBay to have access to mind-blowing compute power. That power is a credit card and a pivotal.io account away from anybody on the planet.
I will conclude by outlining some of the ongoing work that makes Hadoop and its ecosystem projects first class citizens in cloud environments based on the work that Pivotal engineers have done with integrating Hadoop into PivotalONE PaaS.
Kick-R: Get your own R instance with 36 cores on AWSKiwamu Okabe
The document describes a solution called Kick-R that allows running R scripts with 36 cores on AWS to speed up processing. It demonstrates Kick-R running a random forest script on sample spam data over 10 times faster by leveraging 36 AWS cores compared to running locally on a laptop. Instructions are provided on building, running, and cleaning up the Kick-R environment.
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskVíctor Zabalza
# Talk given at PyCon UK 2017
The first step in any data-intensive project is understanding the available data. To this end, data scientists spend a significant part of their time carrying out data quality assessments and data exploration. In spite of this being a crucial step, it usually requires repeating a series of menial tasks before the data scientist gains an understanding ofthe dataset and can progress to the next steps in the project.
In this talk I will detail the inner workings of a Python package that we have built which automates this drudge work, enables efficient data exploration, and kickstarts data science projects. A summary is generated for each dataset, including:
- General information about the dataset, including data quality of each of the columns;
- Distribution of each of the columns through statistics and plots (histogram, CDF, KDE), optionally grouped by other categorical variables;
- 2D distribution between pairs of columns;
- Correlation coefficient matrix for all numerical columns.
Building this tool has provided a unique view into the full Python data stack, from the parallelised analysis of a dataframe within a Dask custom execution graph, to the interactive visualisation with Jupyter widgets and Plotly. During the talk, I will also introduce how Dask works, and demonstrate how to migrate data pipelines to take advantage of its scalable capabilities.
In this session, we'll review the features and architecture of the new AWS Data Pipeline service and explain how you can use it to better manage your data-driven workloads. We'll then go over a few examples of setting up and provisioning a pipeline in the system.
Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza
Lens is an open source Python library for automated data exploration of large datasets using Dask. It computes summary statistics and relationships between columns in a dataset. The results are serialized to JSON for interactive exploration through Jupyter widgets or a web UI. Dask allows the computations to run in parallel across a cluster for scalability. Lens integrates with the SherlockML platform to analyze all datasets uploaded.
Apache Spark: killer or savior of Apache Hadoop?rhatr
The Big Boss(tm) has just OKed the first Hadoop cluster in the company. You are the guy in charge of analyzing petabytes of your company's valuable data using a combination of custom MapReduce jobs and SQL-on-Hadoop solutions. All of a sudden the web is full of articles telling you that Hadoop is dead, Spark has won and you should quit while you're still ahead. But should you?
"Who’s Afraid of Graphs?" by David Ostrovsky
Graphs are everywhere. Friended someone on Facebook? Graphs. Checked the best route to avoid traffic on Google Maps? Graphs. Those recruiters that keep spamming you with job offers on LinkedIn? They find you through graphs. We’re surrounded by problems that can be best represented and solved through graphs, and yet graph databases and processing frameworks remain an obscure niche accessible mainly to data scientists and academics. It’s time to right the injustice and bring graphs to the masses! This session is an introdution to Neo4j, OrientDB, GraphX, Giraph, and others.
1. Hadoop is a framework for distributed processing of large datasets across clusters of computers.
2. Hadoop can be used to perform tasks like large-scale sorting and data analysis faster than with traditional databases like MySQL.
3. Example applications of Hadoop include processing web server logs, managing user profiles for a large website, and performing machine learning on massive datasets.
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
Talk at Scala Up North Jul 21 2017
We will talk about Spotify's story with Scala big data and our journey to migrate our entire data infrastructure to Google Cloud and how Justin Bieber contributed to breaking it. We'll talk about Scio, a Scala API for Apache Beam and Google Cloud Dataflow, and the technology behind it, including macros, algebird, chill and shapeless. There'll also be a live coding demo.
This document provides an overview of Spark, including its history, use cases, architecture, and ecosystem. Some key points:
- Spark is an open-source cluster computing framework that allows processing of large datasets in parallel across compute clusters. It was developed at UC Berkeley in 2009 and became a top-level Apache project in 2013.
- Spark can be used for tasks like log analysis, text processing, analytics, search, and fraud detection on large datasets distributed across clusters. It offers APIs in Scala, Java, Python and can integrate with Hadoop ecosystem.
- Spark uses Resilient Distributed Datasets (RDDs) as its basic abstraction, allowing data to be processed in parallel. Transformations on
Slides from a talk at a meetup organized by SF Scala at Spotify's San Francisco office. The slides present details of playlist recommendations at Spotify and how Spotify uses Scalding to develop robust and reliable pipelines to generate these recommendations.
Meetup details: http://www.meetup.com/SF-Scala/events/224430674/
This document provides an overview of Hadoop architecture. It discusses how Hadoop uses MapReduce and HDFS to process and store large datasets reliably across commodity hardware. MapReduce allows distributed processing of data through mapping and reducing functions. HDFS provides a distributed file system that stores data reliably in blocks across nodes. The document outlines components like the NameNode, DataNodes and how Hadoop handles failures transparently at scale.
Apache Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It provides reliable storage through its distributed file system and scalable processing through its MapReduce programming model. Yahoo! uses Hadoop extensively for applications like log analysis, content optimization, and computational advertising, processing over 6 petabytes of data across 40,000 machines daily.
This document summarizes an agenda for an RDA bootcamp presenting at the Vermont Library Conference on May 21, 2013. The presentation introduces Resource Description and Access (RDA), the new cataloging standard that replaces AACR2. The agenda includes an introduction to RDA and why it was developed, the basics of RDA description and access points, and how to implement RDA in libraries' catalogs. Presenters will discuss how RDA provides a better way to describe materials and gives catalogers more flexibility. While RDA makes some changes, the core of cataloging will remain similar, and hybrid catalogs combining RDA and AACR2 records are acceptable. Attendees are encouraged not to panic about the transition to
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Ontico
В этом докладе я собираюсь поделиться нашим опытом обхода испанского интернета. Мы поставили перед собой задачу обойти около 600 тысяч веб-сайтов в зоне .es с целью сбора статистики об узлах и их размерах. Я расскажу об архитектуре робота, хранилища, проблемах, с которыми мы столкнулись при обходе, и их решении.
Наше решение доступно в форме open source фреймворка Frontera. Фреймворк позволяет построить распределенного робота для скачивания страниц из Интернета в больших объемах в реальном времени. Также он может быть использован для построения сфокусированных роботов для выкачивания подмножества заранее известных веб-сайтов.
Фреймворк предлагает: настраиваемое хранилище URL документов (RDBMS или Key Value), управление стратегиями обхода, абстракцию транспортного уровня, абстракцию модуля загрузки.
Доклад построен в увлекательной форме: описание проблемы, решение и проблемы, которые возникли в ходе разработки решения.
- Cloudera Search provides an overview of using Solr on Hadoop for search capabilities.
- Key projects involved include Lucene, Solr, and Hadoop which can be integrated to allow indexing of data on HDFS and querying via search.
- The presentation discusses architectural details of running Solr on HDFS and integrating other Hadoop projects like HBase, MapReduce, and Hue.
This document contains the agenda for the Kansas City DevOps Meetup on December 5, 2012. The agenda includes presentations on Google Fiberspace and DevOps logistics by Aaron from Cerner and Stathy from OpsCode. It also discusses deciding on topics and volunteers for future meetups, with suggestions like infrastructure as code, continuous deployment, and experience sharing.
Bobby Evans and Tom Graves, the engineering leads for Spark and Storm development at Yahoo will talk about how these technologies are used on Yahoo's grids and reasons why to use one or the other.
Bobby Evans is the low latency data processing architect at Yahoo. He is a PMC member on many Apache projects including Storm, Hadoop, Spark, and Tez. His team is responsible for delivering Storm as a service to all of Yahoo and maintaining Spark on Yarn for Yahoo (Although Tom really does most of that work).
Tom Graves a Senior Software Engineer on the Platform team at Yahoo. He is an Apache PMC member on Hadoop, Spark, and Tez. His team is responsible for delivering and maintaining Spark on Yarn for Yahoo.
The document discusses using Python with Hadoop frameworks. It introduces Hadoop Distributed File System (HDFS) and MapReduce, and how to use the mrjob library to write MapReduce jobs in Python. It also covers using Python with higher-level Hadoop frameworks like Pig, accessing HDFS with snakebite, and using Python clients for HBase and the PySpark API for the Spark framework. Key advantages discussed are Python's rich ecosystem and ability to access Hadoop frameworks.
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.
This document discusses Hive on Spark, which allows Apache Hive queries to run on Apache Spark. It provides background on Hive, Spark, and their limitations. Hive on Spark was developed by the Hive community to leverage Spark's more efficient execution while maintaining compatibility. Examples are given of how simple and join queries are translated from Hive operations to Spark transformations and actions. Improvements to Spark needed to better support Hive are also outlined. The author thanks contributors from various organizations working on Hive on Spark.
The document discusses the need for a W3C community group on RDF stream processing. It notes there is currently heterogeneity in RDF stream models, query languages, implementations, and operational semantics. The speaker proposes creating a W3C community group to better understand these differences, requirements, and potentially develop recommendations. The group's mission would be to define common models for producing, transmitting, and continuously querying RDF streams. The presentation provides examples of use cases and outlines a template for describing them to collect more cases to understand requirements.
The Evolution of Hadoop at Spotify - Through Failures and PainRafał Wojdyła
The quickest way to learn and evolve infrastructure is by encountering obstacles and being forced to overcome limitations that keep you inches away from project goals. At Spotify, we’ve encountered many of these obstacles and frustrations as we grew our Hadoop cluster from a few machines in an office closet aggregating played song events for financial reports, to our current 900 node cluster that plays a large role in many features that you see in our application today.
Two members of Spotify’s Hadoop ‘squad’ will weave in war stories, failures, frustrations and lessons learned to describe the Hadoop/Big Data architecture at Spotify and talk about how that architecture has evolved.
We’ll talk about how and why we use a number of tools, including Apache Falcon and Apache Bigtop to test changes; Apache Crunch, Scalding and Hive w/ Tez to build features and provide analytics; and Snakebite and Luigi, two in-house tools created to overcome common frustrations.
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraDataStax Academy
Evan Chan from Ooyala presents on integrating Apache Spark and Apache Cassandra for interactive analytics. He discusses how Ooyala uses Cassandra for analytics and is becoming a major Spark user. The talk focuses on using Spark to generate dynamic queries over Cassandra data, as precomputing all possible aggregates is infeasible at Ooyala's scale. Chan describes Ooyala's architecture that uses Spark to generate materialized views from Cassandra for fast querying, and demonstrates running queries over a Spark/Cassandra dataset.
Music Personalization : Real time Platforms.Esh Vckay
1. The document discusses music personalization techniques at Spotify, including understanding users and music content, using collaborative filtering and latent vector models to make recommendations, and building real-time recommendation systems using Apache Storm.
2. It describes how Spotify uses machine learning techniques like matrix factorization and word2vec to generate latent vectors for users, songs, artists and playlists to measure similarity and make personalized recommendations at scale for its 75 million users.
3. The key challenges are processing huge amounts of data from 1 billion playlists and 1TB of logs daily to provide recommendations for each new user within 3 seconds and in real-time as listening behaviors change.
OCF.tw's talk about "Introduction to spark"Giivee The
在 OCF and OSSF 的邀請下分享一下 Spark
If you have any interest about 財團法人開放文化基金會(OCF) or 自由軟體鑄造場(OSSF)
Please check http://ocf.tw/ or http://www.openfoundry.org/
另外感謝 CLBC 的場地
如果你想到在一個良好的工作環境下工作
歡迎跟 CLBC 接洽 http://clbc.tw/
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...Paul Leclercq
Paris Spark Meetup - May 2017
Video : https://www.youtube.com/watch?v=w5Zd-1wIJrU
AdHoc analysis of radio stations broadcasts stored in a parquet files with plain SQL, the dataframe API.
The aim was to notice radio stations habits, differences and if radio stations brainwashing is a thing
This talk's Databricks notebook can be found here : https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6937750999095841/3645330882010081/6197123402747553/latest.html
The document discusses a presentation about practical problem solving with Hadoop and Pig. It provides an agenda that covers introductions to Hadoop and Pig, including the Hadoop distributed file system, MapReduce, performance tuning, and examples. It discusses how Hadoop is used at Yahoo, including statistics on usage. It also provides examples of how Hadoop has been used for applications like log processing, search indexing, and machine learning.
This document discusses processing large graphs. It introduces graph processing with MapReduce and Apache Giraph. MapReduce algorithms for finding triangles and connected components in graphs are described. The limitations of MapReduce for graph processing are discussed. Alternative graph processing technologies including Neo4j, a graph database, are presented. A movie recommendation use case is demonstrated using Neo4j to find similar users and recommend unseen movies.
5 things cucumber is bad at by Richard LawrenceSkills Matter
This talk will look at 5 things Cucumber’s bad at, why that’s a good thing, and what it tells us about Cucumber’s sweet spot in a team’s toolkit.
Many times, when people complain about something Cucumber’s not good at, they’re unwittingly describing something Cucumber shouldn't be good at. They’re revealing that they don’t quite understand BDD and Cucumber’s role in it.
Cucumber is the world's most misunderstood collaboration tool and people need to hear this over and over again.
Patterns for slick database applicationsSkills Matter
Slick is Typesafe's open source database access library for Scala. It features a collection-style API, compact syntax, type-safe, compositional queries and explicit execution control. Community feedback helped us to identify common problems developers are facing when writing Slick applications. This talk suggests particular solutions to these problems. We will be looking at reducing boiler-plate, re-using code between queries, efficiently modeling object references and more.
Scala e xchange 2013 haoyi li on metascala a tiny diy jvmSkills Matter
Metascala is a tiny metacircular Java Virtual Machine (JVM) written in the Scala programming language. Metascala is barely 3000 lines of Scala, and is complete enough that it is able to interpret itself metacircularly. Being written in Scala and compiled to Java bytecode, the Metascala JVM requires a host JVM in order to run.
The goal of Metascala is to create a platform to experiment with the JVM: a 3000 line JVM written in Scala is probably much more approachable than the 1,000,000 lines of C/C++ which make up HotSpot, the standard implementation, and more amenable to implementing fun features like continuations, isolates or value classes. The 3000 lines of code gives you:
The bytecode interpreter, together with all the run-time data structures
A stack-machine to SSA register-machine bytecode translator
A custom heap, complete with a stop-the-world, copying garbage collector
Implementations of parts of the JVM's native interface
Although it is far from a complete implementation, Metascala already provides the ability to run untrusted bytecode securely (albeit slowly), since every operation which could potentially cause harm (including memory allocations and CPU usage) is virtualized and can be controlled. Ongoing work includes tightening of the security guarantees, improving compatibility and increasing performance.
ENJOYIN
Oscar reiken jr on our success at manheimSkills Matter
This document discusses test automation at Manheim, a wholesale auto auction company. It describes how test automation was implemented for three of Manheim's major applications: Ove.com, Simulcast, and Manheim.com. Regression testing times were reduced from over 160 hours to under 10 minutes for Ove.com and similar improvements for the other applications. This was achieved by converting test cases to Cucumber scenarios, prioritizing by business value, and implementing the tests in Ruby and Java using tools like Watir and Selenium. The automation allows running hundreds of tests in parallel and integration with a build pipeline.
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...Skills Matter
Code Quotations: Code-as-Data for F#
This tutorial will cover F# Code Quotations in-depth. You'll learn what Code Quotations are, how to use them, and where to apply them in your applications. We'll work through several real-world examples to highlight the important features -- and potential pitfalls -- of Code Quotations.
Cukeup nyc ian dees on elixir, erlang, and cucumberlSkills Matter
Elixir, Erlang, and Cucumberl
Elixir is a new Ruby-inspired programming language that uses the powerful concurrent machinery of Erlang behind the scenes. Cucumberl is a port of Cucumber to Erlang. Let's see what happens when we put them together.
In this talk, we'll discuss:
How Erlang's concurrency makes it easier to write robust programs
Elixir's approachable syntax
How to test Erlang and Elixir programs using Cucumberl
Attendees will walk away with a solid introduction to the principles of Erlang, and an appreciation of the way Elixir brings the joy of Ruby to the solidity of the Erlang runtime.
Cukeup nyc peter bell on getting started with cucumber.jsSkills Matter
Cukeup NYC. Peter Bell on Getting started with cucumber.js
Ever wished you could use cucumber in your javascript apps? In this talk we'll look at the current state of play of cucumber js, when you should and shouldn't use it, and how to get started writing your step definitions in javascript.
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...Skills Matter
In this engaging experience report, we will present 3 different views – Developer, Tester, Business Analyst – of implementing Acceptance Test Driven Development in a complex, data-driven domain. Hear how we used ATDD for building a ubiquitous language across the entire team, promoting faster feedback, and cultivating a culture where product owners were deeply invested in the quality of both every deliverable and the system as a whole.
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...Skills Matter
This document outlines the agenda for a workshop on trying F# for data science. The agenda includes introductions and setup, three sets of hands-on exercises - getting started in F#, financial applications, and data visualization - as well as breaks. It also covers type providers and their benefits, and concludes with a challenge problem and additional resources.
Progressive f# tutorials nyc don syme on keynote f# in the open source worldSkills Matter
F# is a powerful open-source language which Microsoft, other companies and the F# community all contribute to. In this talk, Don will discuss how the “F# space” has recently opened up significantly in interesting ways. F# now includes contributions that range from Cloud IDE platforms, Cloud Compute frameworks, Data interoperability components, Cross-platform execution, Try F#, MonoDevelop, and even Emacs editor integration with surprising tooling support, as well as the Visual F# tools from Microsoft and the broader NuGet package ecosystem. Don will also talk about some of the latest contributions from Microsoft Research, including new type provider components for F#, and describe how his team work with the Visual F# team and other teams around Microsoft. There will also be demos of some fun new stuff that’s been going on with F# at MSR and the community.
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...Skills Matter
Would you like to learn how to make your software testing practices more effective? And how to use your testing strategy to better capture and reflect customer requirements? Gojko Adzic takes a critical look at the effectiveness of current software testing practices and proposes strategies to make it much more effective.
Dmitry mozorov on code quotations code as-data for f#Skills Matter
This document summarizes code quotations in F# and their uses, including for meta-programming, code transformation, testing frameworks, type providers, and data binding. Key points include: code quotations allow treating code as data; F# supports full language quotations unlike C# expression trees; quotations enable composing and decomposing code; and quotations are essential for type providers to access and represent types from other sources. Examples are provided for constructing and splicing quotations, implementing type providers, and using quotations for GUI input validation and data binding.
The document appears to be notes from an acceptance testing session where examples are provided and expanded upon to test different scenarios. It starts with a story about Mary and her lamb and provides examples of adding different variables like location or characters. It then moves to testing a library book reservation system, providing different user types and expected outcomes. The document ends with two short poems used to discuss how word choice matters.
The document discusses the architecture and workflow of deploying applications on Cloud Foundry. It describes how the vmc command line tool is used to target an API endpoint, login, push an application, and shows the steps Cloud Foundry takes to validate the application package, stage it, find an available Diego Application Container instance to run it, and start the application.
The document discusses the concept of serendipity and how to increase serendipitous discoveries through database and system design. It defines serendipity as the occurrence of beneficial discoveries by chance and describes three steps to encourage serendipity: 1) remove isolation by increasing connections across semantic and contextual boundaries, 2) allow information to traverse multiple hops, and 3) weight and filter information based on relevance and user feedback. Graph databases are said to better support serendipity compared to relational databases by more easily facilitating these three steps.
Simon Peyton Jones: Managing parallelismSkills Matter
If you want to program a parallel computer, it obviously makes sense to start with a computational paradigm in which parallelism is the default (ie functional programming), rather than one in which computation is based on sequential flow of control (the imperative paradigm). And yet, and yet ... functional programmers have been singing this tune since the 1980s, but do not yet rule the world. In this talk I’ll say why I think parallelism is too complex a beast to be slain at one blow, and how we are going to be driven, willy-nilly, towards a world in which side effects are much more tightly controlled than now. I’ll sketch a whole range of ways of writing parallel program in a functional paradigm (implicit parallelism, transactional memory, data parallelism, DSLs for GPUs, distributed processes, etc, etc), illustrating with examples from the rapidly moving Haskell community, and identifying some of the challenges we need to tackle.
The document discusses big data and Hadoop. It notes that big data comes in terabytes and petabytes, sometimes generated daily. Hadoop is presented as a framework for distributed computing on large datasets using MapReduce. While Hadoop can store and process massive amounts of data across commodity servers, it was not designed for business intelligence requirements. The document proposes addressing this by adding data integration and transformation capabilities to Hadoop through tools like Pentaho Data Integration, to enable it to better meet the needs of big data analytics.
This document discusses different types of "magic" that can be done in Pentaho reporting including parameter magic, wizard magic, and query magic. Parameter magic allows parameters to control system settings like enabling server-side printing. Wizard magic involves features for customizing report outputs like summary fields and row layout. Query magic refers to functions that allow running queries and using results in reports or parameters, primarily for calculated default values. Demos are provided for each type of magic.
I went to_a_communications_workshop_and_they_tSkills Matter
The document discusses lessons from a communications workshop. It covers:
1. The benefits of continuous integration (CI) automation over manual processes, including peace of mind and high visibility.
2. An introduction to the Community Build Framework (CBF) which manages server configurations and automatically patches builds.
3. Types of tests that can be automated, including databases, ETL, user interfaces, and more.
This document summarizes Saiku, an open source business intelligence and analytics tool. Saiku provides a lightweight user interface using HTML and JavaScript with a separate Java server and RESTful JSON communication. It is 100% open source and easily integrates with other data sources like SAP BW, Microsoft Analysis Services, and Mondrian. The roadmap outlines upcoming releases in early 2011 that will add features like drill support, visualizations, and basic integration with SAP BW.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
2. What is Last.fm?
A music community website, powered by scrobbling
that provides personalised radio.
We aggregate scrobbles. A single scrobble is the
smallest unit of music attention data.
1 scrobble = (track, artist, timestamp).
3. In numbers
• 40 million users visit the site every month
• 39 billion scrobbles (600 per second)
• 400k personalised radio stations per day
enter hadoop...
4. Hadoop cluster
• 44 nodes
• 8 cores per node
• 16 gig ram per node
• 4x 1TB 7200rpm disks per node
5. Hadoop what is it good for?
• Charts
• Reporting
• Corrections
• Site stats / metrics
• Neighbours
• Recommendations
6. But wait, can you tell us about <stuff/>?
• How many?
• When?
• Where?
• Who?
• Why? Why not?
7. Ad hoc questions
• We get them all the time.
• Questions are good things, but answers take up
time.
• We would typically write programs once, run
once.
enter Hive...
8. What is Hive?
"Hive is a data warehouse infrastructure built on
top of Hadoop"
You get an SQL-like language for queries.
Start queries from a shell, file, jdbc, thrift.
13. Example:
SELECT artistid, insertdate, count(1)
FROM scrobbles
WHERE (trackid = 10019 OR trackid = 368575614)
AND insertdate >= '2009-12-01'
AND insertdate <= '2009-12-31'
GROUP BY artistid, insertdate
ORDER BY artistid, insertdate;
14. Example:
Users that
scrobble
? Users that
use the radio
15. Example:
SELECT count(1) FROM scrobbles GROUP BY userid;
SELECT count(1) FROM radiologs GROUP BY userid;
SELECT count(1) FROM
radiologs r JOIN scrobbles s
ON r.userid = s.userid
GROUP BY r.userid;
16. Example:
Consider a user's scrobbles and radio listens for just one track
First scrobble!
Scrobbles
Radio
Time
17. Example:
SELECT r.userid, r.trackid, count(1)
FROM
(
SELECT userid, trackid, min(unixtime) as unixtime
FROM scrobbles GROUP BY userid, trackid
) s
JOIN
radiologs r
ON r.userid = s.userid AND r.trackid = r.trackid
WHERE s.unixtime < r.unixtime
GROUP BY r.userid, r.trackid
18. Other nice things about hive
• Joins are really really easy (most of the time).
19. Preparing a search index
The crowd Labels
cloud scrobbles
Charts corrections Catalogue
Hive
Solr
artists albums tracks
20. Not so great
• No recordio.
• Really huge joins can cause out of memory
exceptions.