Big-Data Technology Innovation: Hadoop, Real-time, and Machine Learning Andy Feng VP Architecture, Yahoo! Yahoo started developing big-data technology with Hadoop MapReduce and File System in 2006, and made it an Apache open source project in 2009. Since then, big data has become a major component of the global tech industry, and Yahoo is leading the way. In the past three years, Yahoo has been a leading contributor to Apache Storm for event processing, Apache HBase for distributed NoSQL stores, Apache Spark for faster processing, and Druid for sub-second analytics. We have created new open source projects such as Apache Omid for transactional support of NoSQL stores, Yahoo Data Sketches for approximate analytics, and Yahoo CaffeOnSpark for distributed deep learning. In this talk, we walk through Yahoo use cases (search, advertising, personalization, and Flickr) where our big-data technologies are best exemplified. We explain how Yahoo leverages these technologies to perform real-time processing and advanced machine learning against 600 petabytes of data, and describe the system architecture of our heterogeneous clusters of 40,000 servers for supporting a variety of workloads. We provide an overview of open source technologies (Apache Storm, Apache HBase, Apache Omid, and Yahoo CaffeOnSpark) and our in-house technology for large-scale machine learning. We discuss how academic researchers and industry technologists can help advance big-data technologies further. Bio Dr. Andy Feng is a VP of Architecture at Yahoo leading the architecture and design of big data and machine learning initiatives. He’s architected major platforms for personalization, ad serving, NoSQL, and cloud infrastructure. Prior to Yahoo, he was a Chief Architect at Netscape/AOL, and Principal Scientist at Xerox.
(1min – T 2min)
Yahoo has a long history of involvement with Hadoop and Big Data. From our initial work to open source Hadoop in Apache, to our effort stabilizing YARN at Scale, to recent efforts working with the Apache Community to drive Scale and utilization in Hadoop, the Hadoop Stack, and Storm. I plan to overview several areas which were all talks this year at Hadoop Summit: Overcommit of Hardware, Migration to Tez to improve Pig/Hive job utilization, and finally Storm’s Resource Aware Scheduler to improve Storm hardware utilization with the hope that you will be intrigued enough to attend the talk or go back and watch it after the summit
TODOS: Work in Reference last Sumeet keynote? Work in reference the YARN talk from 2-3 years ago and breaking Scale benchmark Maybe Mention Community effort (find way to mention Hortonworks on Tez, Twitter? others?
We released the magic view as part of the Flickr 4.0 release last April, and this is the most visible user-facing feature that exposes our image recognition capabilities. Our users can switch from the traditional timeline view of their photo to an experience where their photos are arranged according to 70 categories. For example, you can see here that landscape photos are sub-categorized into different types such as mountain, rock, or shore.
This is a great feature for serendipitous photo discovery. Most of us have thousands of photos that we don’t get to see very often but are emotionally very attached to, and these types of groupings help us re-discover photos.
Let’s start with WHY machine learning.
Search is one of the key applications for Yahoo. For a user’s search phrase, we construct a result page with organic contents together with ads.
To generate result page, we rank contents basedtheir relevance to query terms, match ads against query, and predict the probability of ad click. Several machine learning algorithms are applied in this process: decision tree, logistic regression and neural network.
(2 min – T 7 min) And, best yet, CaffeOnSpark was open sourced last month with Apache 2.0 license
(1 min – T 11 min) Certain class of problems in big data analytics don’t scale well due to queries taking too much time or resources, such as count distinct, most frequent, quantiles etc. And that’s where Sketches algorithms come in where “good enough” approximate answers work great for interactivity (and real-time stream data) We have used Sketches successfully for several use cases such as audience analytics and Flurry analytics for our Mobile Developer Suite Sketches integrates really well with Druid for sub-second OLAP where we have many lots of contributions recently such as dimension joins, reliable pull-based real-time ingestion, and schema introspection Sketches is now available in open source, and integrates well with Pig and Hive from the Hadoop ecosystem
(2 min – T 10 min) In the absence of one, we established a real-world streaming benchmark, code is available on Git. I am excited to tell you that most of these multi-tenancy, scale, and security changes are available in the community releases or are on their way to be released
(1 min – T 12 min) HBase is another cornerstone technology that we rely on extensively and there are applications on HBase that need to bundle multiple read and write operations into a single unit of work, and that’s’ exactly where Omid comes in With Omid, applications can execute transactions with ACID properties without worrying about performance and fault tolerance Omid executes millions of transactions per day for our incremental content management platform for nextgen search and personalization products And, I am pleased to say that the same technology is now available as a new Apache incubator project
Applications that need to bundle multiple read and write operations on HBase into logically indivisible units of work can use Omid to execute transactions with ACID (Atomicity, Consistency, Isolation, Durability) properties, just as they would use transactions in the relational database world. Omid extends the HBase key-value access APl with transaction semantics. It can be exercised either directly, or via higher level data management API’s. For example, Apache Phoenix (SQL-on-top-of-HBase) might use Omid as its transaction management component.
Development. Omid is backward-compatible with HBase APIs, making it developer friendly. Minimal extensions are introduced to enable transactions. Semantics. Omid implements a popular, well-tracted Snapshot Isolation (SI) consistency paradigm that is supported by major SQL and NoSQL technologies (for example, Percolator). Scalability. Omid provides a highly scalable, lock-free implementation of SI. To the best of our knowledge, it is the only open source NoSQL platform that can scale beyond 100K transactions per second. Reliability. Omid has a high-availability (HA) mode, in which the core service operates as primary-backup process pair with automatic failover. The HA support has zero overhead on the mainstream operation. Simplicity. Omid leverages the HBase infrastructure for managing its own metadata. It entails no additional services apart from those used by HBase. Track Record. Omid is already in use by very-large-scale production systems at Yahoo.
Big Data Day LA 2016 Keynote - Andy Feng/ Yahoo
Hadoop , Real - time, and Machine Learning
A n d y F e n g
Ya h o o
Big-Data Impact: Search2Vec
Bucket Tests Query
Auction Depth Revenue per Search
Simple model vs.
+1.14% +2.13% +7.07%
vs. Small model
+2.44% +2.39% +9.39%
(= +17.12% vs. Baseline)
11Yahoo Confidential & Proprietary
CaffeOnSpark: Distributed Deep Learning
Data Sketches Algorithms Library
Sub-second User Facing Analytics
Apache Storm: Real-time Processing
MT & RA
Apache Omid: Transactions for NoSQL DB
• Multi-row/multi-table transactions
• Snapshot isolation