This document discusses Last.fm's use of HFiles outside of HBase. It summarizes tests performed comparing Last.fm's original plain text file format to a new binary format based on HFiles. The HFile format reduced file size by 80% and query times by over 90%. Last.fm is moving its chartserver data storage to HBase to address indexing slowness and allow different teams to use different NoSQL systems. The document also advertises two open data scientist positions at Last.fm.
HFile: A Block-Indexed File Format to Store Sorted Key-Value PairsSchubert Zhang
HFile is a mimic of Google’s SSTable. Now, it is available in Hadoop HBase-0.20.0. And the previous releases of HBase temporarily use an alternate file format – MapFile, which is a common file format in Hadoop IO package. I think HFile should also become a common file format when it becomes mature, and should be moved into the common IO package of Hadoop in the future.
ObjectLayout: Closing the (last?) inherent C vs. Java speed gapAzul Systems Inc.
In this presentation, Azul CTO Gil Tene provides a description of the ObjectLayout.org effort, with a focus on StructuredArray. The goal of this effort is to match the raw speed benefits C-based languages get from commonly used forms of memory layout and make these benefits available for 'Plain Old Java Object" (POJO) use.
HFile: A Block-Indexed File Format to Store Sorted Key-Value PairsSchubert Zhang
HFile is a mimic of Google’s SSTable. Now, it is available in Hadoop HBase-0.20.0. And the previous releases of HBase temporarily use an alternate file format – MapFile, which is a common file format in Hadoop IO package. I think HFile should also become a common file format when it becomes mature, and should be moved into the common IO package of Hadoop in the future.
ObjectLayout: Closing the (last?) inherent C vs. Java speed gapAzul Systems Inc.
In this presentation, Azul CTO Gil Tene provides a description of the ObjectLayout.org effort, with a focus on StructuredArray. The goal of this effort is to match the raw speed benefits C-based languages get from commonly used forms of memory layout and make these benefits available for 'Plain Old Java Object" (POJO) use.
Java heap memory model has wasteful memory usage. References, object headers, internal collection structure, extra fields such as String.hashCode… This talk shows practical ways to reduce memory usage and fit more data into memory: primitive types, specialized java collections, bit packing, reducing number of pointers, replacing String with char[], semi-serialized objects… As bonus we get lower GC overhead by reducing number of references.
EuroPython 2015 - Big Data with Python and HadoopMax Tepkeev
Big Data - these two words are heard so often nowadays. But what exactly is Big Data ? Can we, Pythonistas, enter the wonder world of Big Data ? The answer is definitely “Yes”.
This talk is an introduction to the big data processing using Apache Hadoop and Python. We’ll talk about Apache Hadoop, it’s concepts, infrastructure and how one can use Python with it. We’ll compare the speed of Python jobs under different Python implementations, including CPython, PyPy and Jython and also discuss what Python libraries are available out there to work with Apache Hadoop.
This presentation explain about "Apache Cassandra's concepts and architecture".
My friends and colleagues said
"This presentation should be release on public space to help many peoples work in IT"
so, I upload this file for everyone love "Technology for the people"
This presentation used for educating the employee of KT last year.
Daniel Krasner - High Performance Text Processing with Rosetta PyData
This talk covers rapid prototyping of a high performance scalable text processing pipeline development in Python. We demonstrate how Python modules, in particular from the Rosetta library, can be used to analyze, clean, extract features, and finally perform machine learning tasks such as classification or topic modeling on millions of documents. Our style is to build small and simple modules (each with command line interfaces) that use very little memory and are parallelized with the multiprocessing library.
이 발표는 2018년 4월 14일 서울에서 열린 TensorFlow Dev Summit Extended Seoul '18 에서 TensorFlow Dev Summit 2018의 발표 내용 중 TensorFlow.Data 및 TensorFlow.Hub에 관한 발표들을 정리한 내용입니다.
This presentation summarizes the talks about TensorFlow.Data and TensorFlow.Hub among the sessions of TensorFlow Dev Summit 2018, and presented at TensorFlow Dev Summit Extended Seoul '18 held on April 14, 2018 in Seoul.
This Tutorial is designed for the HDF5 users with some HDF5 experience. It will cover properties of the HDF5 objects that affect I/O performance and file sizes. The following HDF5 features will be discussed: partial I/O, chunking and compression, and complex HDF5 datatypes such as strings, variable-length arrays and compound datatypes.
We will also discuss references to objects and datasets regions and how they can be used for indexing. Participants will work with the Tutorial examples and exercises during the hands-on sessions.
Java 5 PSM for DDS: Initial Submission (out of date)Rick Warren
Presentation to the OMG's MARS Task Force in June, 2010 on proposed improvements to the Java API to the OMG's Data Distribution Service specification (DDS).
If you've tried Apache Solr 1.4, you've probably had a chance to take it for a spin indexing and searching your data, and getting acquainted with its powerful, versatile new features and functions. Now, it's time to roll up your sleeves and really master what Solr 1.4 has to offer.
In this session we will introduce the concepts around replica sets in MongoDB, which provide automated failover and recovery of nodes. You’ll learn how to set up, configure, and develop with replica sets, and how to tune consistency and durability according to your application’s requirements. We’ll also review common deployment scenarios.
A proposal for combining two different technologies, Solr and a triple store, in order to improve the (user) search experience by decoupling the “search” from the “view” perspective.
Yahoo’s data ETL pipeline continuously processes more than tens of terabytes of data every day. Seeking for a good data storage methodology that can store and fetch this data efficiently has always been a challenge for the Yahoo data ETL pipeline. A study done recently inside Yahoo has shown a dramatic data size reduction by switching from Sequence to RC File Format. We have decided to take the approach of converting our data to the RC File Format. The most challenging task is to manually serialize the data objects. We rely on Jute, a Hadoop Record Compiler, to provide serialization code. However, Jute does not support RC File Format. In addition, RC file format does not support native Hadoop writable objects. Therefore writing serialization code becomes complicated and repetitive. Hence, we invented the JuteRC compiler which is an extension to the Hadoop Record Compiler (Jute). It generates serialization/deserialization code for any user defined primitive or composite data types. MapReduce programmer can directly plug in the serialization/deserialization code to generate MapReduce output data file that is in RC File Storage Format. With the help of JuteRC compiler, our experiment against Yahoo audience data showed a 26-28% file size reduction and 40% read/write performance improvement compared to Sequence File. We are currently in the process to open source JuteRC.
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
What’s important about a technology is what you can use it to do. I’ve looked at what a number of groups are doing with Apache Hadoop and NoSQL in production, and I will relay what worked well for them and what did not. Drawing from real world use cases, I show how people who understand these new approaches can employ them well in conjunction with traditional approaches and existing applications. Thread Detection, Datawarehouse optimization, Marketing Efficiency, Biometric Database are some examples exposed during this presentation.
Java heap memory model has wasteful memory usage. References, object headers, internal collection structure, extra fields such as String.hashCode… This talk shows practical ways to reduce memory usage and fit more data into memory: primitive types, specialized java collections, bit packing, reducing number of pointers, replacing String with char[], semi-serialized objects… As bonus we get lower GC overhead by reducing number of references.
EuroPython 2015 - Big Data with Python and HadoopMax Tepkeev
Big Data - these two words are heard so often nowadays. But what exactly is Big Data ? Can we, Pythonistas, enter the wonder world of Big Data ? The answer is definitely “Yes”.
This talk is an introduction to the big data processing using Apache Hadoop and Python. We’ll talk about Apache Hadoop, it’s concepts, infrastructure and how one can use Python with it. We’ll compare the speed of Python jobs under different Python implementations, including CPython, PyPy and Jython and also discuss what Python libraries are available out there to work with Apache Hadoop.
This presentation explain about "Apache Cassandra's concepts and architecture".
My friends and colleagues said
"This presentation should be release on public space to help many peoples work in IT"
so, I upload this file for everyone love "Technology for the people"
This presentation used for educating the employee of KT last year.
Daniel Krasner - High Performance Text Processing with Rosetta PyData
This talk covers rapid prototyping of a high performance scalable text processing pipeline development in Python. We demonstrate how Python modules, in particular from the Rosetta library, can be used to analyze, clean, extract features, and finally perform machine learning tasks such as classification or topic modeling on millions of documents. Our style is to build small and simple modules (each with command line interfaces) that use very little memory and are parallelized with the multiprocessing library.
이 발표는 2018년 4월 14일 서울에서 열린 TensorFlow Dev Summit Extended Seoul '18 에서 TensorFlow Dev Summit 2018의 발표 내용 중 TensorFlow.Data 및 TensorFlow.Hub에 관한 발표들을 정리한 내용입니다.
This presentation summarizes the talks about TensorFlow.Data and TensorFlow.Hub among the sessions of TensorFlow Dev Summit 2018, and presented at TensorFlow Dev Summit Extended Seoul '18 held on April 14, 2018 in Seoul.
This Tutorial is designed for the HDF5 users with some HDF5 experience. It will cover properties of the HDF5 objects that affect I/O performance and file sizes. The following HDF5 features will be discussed: partial I/O, chunking and compression, and complex HDF5 datatypes such as strings, variable-length arrays and compound datatypes.
We will also discuss references to objects and datasets regions and how they can be used for indexing. Participants will work with the Tutorial examples and exercises during the hands-on sessions.
Java 5 PSM for DDS: Initial Submission (out of date)Rick Warren
Presentation to the OMG's MARS Task Force in June, 2010 on proposed improvements to the Java API to the OMG's Data Distribution Service specification (DDS).
If you've tried Apache Solr 1.4, you've probably had a chance to take it for a spin indexing and searching your data, and getting acquainted with its powerful, versatile new features and functions. Now, it's time to roll up your sleeves and really master what Solr 1.4 has to offer.
In this session we will introduce the concepts around replica sets in MongoDB, which provide automated failover and recovery of nodes. You’ll learn how to set up, configure, and develop with replica sets, and how to tune consistency and durability according to your application’s requirements. We’ll also review common deployment scenarios.
A proposal for combining two different technologies, Solr and a triple store, in order to improve the (user) search experience by decoupling the “search” from the “view” perspective.
Yahoo’s data ETL pipeline continuously processes more than tens of terabytes of data every day. Seeking for a good data storage methodology that can store and fetch this data efficiently has always been a challenge for the Yahoo data ETL pipeline. A study done recently inside Yahoo has shown a dramatic data size reduction by switching from Sequence to RC File Format. We have decided to take the approach of converting our data to the RC File Format. The most challenging task is to manually serialize the data objects. We rely on Jute, a Hadoop Record Compiler, to provide serialization code. However, Jute does not support RC File Format. In addition, RC file format does not support native Hadoop writable objects. Therefore writing serialization code becomes complicated and repetitive. Hence, we invented the JuteRC compiler which is an extension to the Hadoop Record Compiler (Jute). It generates serialization/deserialization code for any user defined primitive or composite data types. MapReduce programmer can directly plug in the serialization/deserialization code to generate MapReduce output data file that is in RC File Storage Format. With the help of JuteRC compiler, our experiment against Yahoo audience data showed a 26-28% file size reduction and 40% read/write performance improvement compared to Sequence File. We are currently in the process to open source JuteRC.
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
What’s important about a technology is what you can use it to do. I’ve looked at what a number of groups are doing with Apache Hadoop and NoSQL in production, and I will relay what worked well for them and what did not. Drawing from real world use cases, I show how people who understand these new approaches can employ them well in conjunction with traditional approaches and existing applications. Thread Detection, Datawarehouse optimization, Marketing Efficiency, Biometric Database are some examples exposed during this presentation.
With Dask and Numba, you can NumPy-like and Pandas-like code and have it run very fast on multi-core systems as well as at scale on many-node clusters.
Donald Miner will do a quick introduction to Apache Hadoop, then discuss the different ways Python can be used to get the job done in Hadoop. This includes writing MapReduce jobs in Python in various different ways, interacting with HBase, writing custom behavior in Pig and Hive, interacting with the Hadoop Distributed File System, using Spark, and integration with other corners of the Hadoop ecosystem. The state of Python with Hadoop is far from stable, so we'll spend some honest time talking about the state of these open source projects and what's missing will also be discussed.
Morten Egan gives a short introduction to Big Data and what it is all about. What is MapReduce, HDFS, Hive, Pig and HCatalog? Also, a short introduction to Hortonworks.
This presentation was made for the danish oracle user group.
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
This talk will describe his research into using Hadoop to query and manage big geographic datasets, specifically OpenStreetMap(OSM). OSM is an “open-source” map of the world, growing at a large rate, currently around 5TB of data. The talk will introduce OSM, detail some aspects of the research, but also discuss his experiences with using the SpatialHadoop stack on Azure and Google Cloud.
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
This talk, given at Berlin Buzzwords 2019, describes the recent progress in making Hopsworks a cloud-native platform, with HA data-center support added for HopsFS.
Hadoop is a Java software framework that supports data-intensive distributed applications and is developed under open source license. It enables applications to work with thousands of nodes and petabytes of data.
Apache Bigtop has created the de-facto standard in how Hadoop-based stacks are developed, delivered, and managed. We are at it again! The track will present the composition of the next generation of in-memory computing stack that is completely built out of open-source components. The next generation of the Apache data processing stack will focus on in-memory and transactional processing of large amounts of data. We will also be talking about performance benefits that legacy data-processing software based on MapReduce, Hive, and similar, can derive from in-memory computing. This session will discuss and analyze the benefits of practicing Fast Data in the open.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
11. File Format
• Easy to grep / read, from the
command line.
• Server is easy to implement &
maintain.
• Very fast thanks to the index. Very
sparse though.
• Disk space not really and issue here.
We can always get rid of old indexes.
• Problem?
11
Key1 x Size
0
0
0
Key2 x Size
0
0
0
0
KeyN x Size
Key1 Value 1
Key1 Value 2
Key1 Value 3
Key1 Value 4
Key1 Value 5
Key2 Value 1
Key2 Value 2
Key2 Value 3
Key2 Value 4
Key2 Value 5
Key2 Value 6
...
...
...
...
...
...
...
...
...
KeyN Value 1
KeyN Value 2
KeyN Value 3
KeyN Value 4
KeyN Value 5
KeyN Value 6
KeyN Value 7
Index File Data File
12. File Format
• Easy to grep / read, from the
command line.
• Server is easy to implement &
maintain.
• Very fast thanks to the index. Very
sparse though.
• Disk space not really and issue here.
We can always get rid of old indexes.
• Problem?
• It takes more Hme to generate the
index than to create the Data File in
Hadoop.
12
Key1 x Size
0
0
0
Key2 x Size
0
0
0
0
KeyN x Size
Key1 Value 1
Key1 Value 2
Key1 Value 3
Key1 Value 4
Key1 Value 5
Key2 Value 1
Key2 Value 2
Key2 Value 3
Key2 Value 4
Key2 Value 5
Key2 Value 6
...
...
...
...
...
...
...
...
...
KeyN Value 1
KeyN Value 2
KeyN Value 3
KeyN Value 4
KeyN Value 5
KeyN Value 6
KeyN Value 7
Index File Data File
13. File Format
• Easy to grep / read, from the
command line.
• Server is easy to implement &
maintain.
• Very fast thanks to the index. Very
sparse though.
• Disk space not really and issue here.
We can always get rid of old indexes.
• Problem?
• It takes more Hme to generate the
index than to create the Data File in
Hadoop.
• Like... 6 Hmes more.
13
Key1 x Size
0
0
0
Key2 x Size
0
0
0
0
KeyN x Size
Key1 Value 1
Key1 Value 2
Key1 Value 3
Key1 Value 4
Key1 Value 5
Key2 Value 1
Key2 Value 2
Key2 Value 3
Key2 Value 4
Key2 Value 5
Key2 Value 6
...
...
...
...
...
...
...
...
...
KeyN Value 1
KeyN Value 2
KeyN Value 3
KeyN Value 4
KeyN Value 5
KeyN Value 6
KeyN Value 7
Index File Data File
16. Requirements for the new file format:
• Binary:
– So it is smaller.
– Store thriZ serialized data.
• Compression friendly.
• Self indexed:
– We do not want an index file anymore.
• Hadoop friendly:
– Generated in Hadoop, we don’t want to preprocess it before serving.
• Java/C++/Python friendly:
– These are the languages used in the Data and M.I.R. teams.
16
17. Requirements for the new file format:
• Binary:
– So it is smaller.
– Store thriZ serialized data.
• Compression friendly:
• Self indexed:
– We do not want an index file anymore.
• Hadoop friendly:
– Generated in Hadoop, we don’t want to preprocess it before serving.
• Java/C++/Python friendly:
– These are the languages used in the Data and M.I.R. teams.
– Yeah, we sLll use C++.
17
18. !
KeyLen (int) ValLen (int) Key (byte[]) Value (byte[])
DATA BLOCK MAGIC (8B)
Key-Value (First)
……
Key-Value (Last)
Data Block 0
Data Block 1
Data Block 2
Meta Block 0
(Optional)
Meta Block 1
(Optional)
User Defined Metadata,
start with METABLOCKMAGIC
KeyLen
(vint)
Key
(byte[])
id
(1B)
ValLen
(vint)
Val
(byte[])
File Info
Size or ItemsNum (int)
LASTKEY (byte[])
AVG_KEY_LEN (int)
AVG_VALUE_LEN (int)
COMPARATOR (className)
Data Index
Meta Index
(Optional)
Index of Data Block 0
…
User Defined
INDEX BLOCK MAGIC (8B)
Index of Meta Block 0
…
Offset(long) DataSize (int) Key (byte[])
KeyLen (vint)
Trailer INDEX BLOCK MAGIC (8B)
Fixed File Trailer
(Go to next picture)
Offset(long) MetaSize (int) MetaNameLen (vint) MetaName (byte[])
3
HFile:
18
by Schubert Zang
hqp://cloudepr.blogspot.com
• Based on Google’s SSTable (From Bigtable)
• Keys and Values are byte strings.
• Keys are ordered.
• Sequence of blocks.
• Block index loaded into memory.
• Can be queried with hbase
org.apache.hadoop.hbase.io.hfile.HFile
19. HFile:
19
// create an HFile reader from a file.
Hfile.Reader reader = new HFile.Reader(fs,
filePath, new SimpleBlockCache(),true);
// load its info into memory.
reader.loadFileInfo();
// get a Scanner
HFileScanner scan = reader.getScanner(true,true);
// create the key we are interested in.
KeyValue kvKey = new KeyValue(Bytes.toBytes(key),
Bytes.toBytes(“f”),...);
// check if the key is in the file.
if (0 != scan.seekTo(kvKey.getKey()) {
log.error(“Couldn’t find the key”);
} else {
log.info(“Value:” +
scan.getKeyValue().getValue());
}
28. We are hiring! (http://www.last.fm/about/jobs)
28
Data Scientist
Purpose & Background of Role
We're seeking two top notch data scientists with strong programming skills to join the
small and very enthusiastic data and recommendations team at Last.fm. These two
positions are full-time and based in London.
Are you a superb data analyst as well as a hands-on implementer that understands the
trade-offs of the memory hierarchy and is able to work around constraints in disk speed,
memory size and CPU cycles? Are you familiar with all common data structures and their
complexity? Do you take pride in being clever and solving difficult problems creatively?
Are you full of ideas and always looking for new ways of making use out of data? Are you
an advocate for data-driven development and fully capable of conducting a proper A/B
test? Do you love music?
Requirements:
• Solid background in statistics and computer science
• Highly fluent in Python and either C++ or Java (or both)
• Comfortable with the Unix CLI and shell scripting
• Passion for machine learning and data visualisation
• Proficient with databases, both relational and non-relational
• Experience with Hadoop and analysing terabyte-scale datasets
• Familiar with data-driven development and split testing
• Basic understanding of common web technologies
• Track record in music information retrieval research is a plus
29. We are hiring! (http://www.last.fm/about/jobs)
29
Data Scientist
Purpose & Background of Role
We're seeking two top notch data scientists with strong programming skills to join the
small and very enthusiastic data and recommendations team at Last.fm. These two
positions are full-time and based in London.
Are you a superb data analyst as well as a hands-on implementer that understands the
trade-offs of the memory hierarchy and is able to work around constraints in disk speed,
memory size and CPU cycles? Are you familiar with all common data structures and their
complexity? Do you take pride in being clever and solving difficult problems creatively?
Are you full of ideas and always looking for new ways of making use out of data? Are you
an advocate for data-driven development and fully capable of conducting a proper A/B
test? Do you love music?
Requirements:
• Solid background in statistics and computer science
• Highly fluent in Python and either C++ or Java (or both)
• Comfortable with the Unix CLI and shell scripting
• Passion for machine learning and data visualisation
• Proficient with databases, both relational and non-relational
• Experience with Hadoop and analysing terabyte-scale datasets
• Familiar with data-driven development and split testing
• Basic understanding of common web technologies
• Track record in music information retrieval research is a plus
x 2