Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How Spark Fits into Baidu's Scale-(James Peng, Baidu)


Published on

Presentation at Spark Summit 2015

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website!
    Are you sure you want to  Yes  No
    Your message goes here

How Spark Fits into Baidu's Scale-(James Peng, Baidu)

  1. 1. How Spark Fits Baidu’s Scale Dr. James Peng Principal Architect, Baidu U.S.A.
  2. 2. Baidu is Leading Chinese Search Engine No. 1 Website in China By average daily visitors and page views No. 1 Chinese search engine By number of search question conducted No. 1 Most valuable brand Among private sector companies in China 73 % Search market share in China $73 bn. Market capitalization Premium global Internet company $2 bn. Q1 2015 total revenues By number of search question conducted 47,000 Employees Top 3 China’s best employer
  3. 3. Baidu’s Big Data Infrastructure
  4. 4. Baidu’s History in Big Data Infrastructure 2003 2008 2009 2010 2011 2012 2013 2014 Distributed Search System Hadoop in production Distributed webpage storage: storing over 100 billion web pages Large-scale machine learning platform > 10000 nodes Distributed computing engine in production Real-time streaming system with delay in milliseconds range Large-scale deployment of in-house developed network switcher and ARM servers Large scale DNN: supporting over 100 billion features
  5. 5. Largest MR Cluster 0 2000 4000 6000 8000 10000 12000 14000 2008 2009 2010 2011 2012 2013 MR Cluster Size Number of Nodes 0 200000 400000 600000 800000 1000000 1200000 2008 2009 2010 2011 2012 2013 MR Cluster Daily Load Number of Jobs
  6. 6. Baidu’s Distributed Shuffle Engine Champion of 2014 Sort Benchmark Indy 2014, 8.38 TB/min BaiduSort 100 TB in 716 seconds  982 nodes x  (2 2.10Ghz Intel Xeon E5-2450, 192 GB memory, 8x3TB 7200 RPM SATA)  Dasheng Jiang  Baidu Inc. and Peking University
  7. 7. Baidu Proprietary Platforms DCE/DAG •  DAG model Distributed Computing Engines •  Outperforms open-source MR by at least 30% DSTREAM •  Distributed real- time computation engine •  99.99% of tasks with latency less than 50 ms TM •  mini-batch execution engine •  Transaction- based, guarantee exactly once delivery PADDLE •  PArallel Distributed Deep Learning Engine •  With CPU / GPU support and suitable for vision, speech, and NLP workloads ELF •  Essential Learning Framework •  Dataflow- based programming model •  Parameter server able to store > 1 trillion parameters THERE’S STRONG, AND THEN THERE’S BAIDU STRONG
  8. 8. A Spark in Baidu
  9. 9. •  Scale: > 1000 nodes(20000 Cores, 100 TB Memory) •  Daily Jobs:2000~3000 •  Supported Business Units: Ads, Search, Map, Commerce, etc Initial Evaluation Spark 0.8 Internal Beta Usage Spark 0.9 Beta Cluster with 200 nodes Spark 1.0 Incubation Integrate into Baidu Open Cloud Spark 1.1 Spark SQL in Production Spark 1.2 Continue Scaling Spark 1.3 Production and Scaling Spark’s History in Baidu
  10. 10. Spark in Baidu Internal Infrastructure SparkDSTREAM TM DCE/DAG ELF Compute Layer Baidu Data Centers + Resource Management + Resource Scheduling Resource Layer Dataflow Compiler Java C++ Python Scala SQL Programming Layer
  11. 11. Spark in Baidu Open Cloud Baidu Cloud Cluster HDFS Map Reduce Baidu Object Storage HBase Pig Hive SQL Streaming GraphX MLLib BMR Console BMR SDK Audit Scheduler Management
  12. 12. Enabling Interactive Queries with Spark and Tachyon
  13. 13. Need for Interaction (Speed) Problem •  John is a PM and he needs to keep track of the top queries submitted to Baidu everyday •  Based on the top queries of the day, he will perform additional analysis •  But John is very frustrated that each query takes tens of minutes Requirements for Baidu Interactive Query Engine •  Manages Petabytes of data •  Able to finish 95% of queries within 30 seconds
  14. 14. Query UI Data Warehouse Baidu File System Storage Layer Tachyon  Hot Data Layer Baidu Interactive Query Architecture Compute Layer Meta Store HiveQL Spark SQL UDFs SerDes Apache Spark
  15. 15. Baidu Interactive Query Performance > 50X Acceleration of Big Data Analytics Workloads 0 200 400 600 800 1000 1200 MR (sec) Spark (sec) Spark + Tachyon (sec) Setup: 1.  Use MR to query 6 TB of data 2.  Use Spark to query 6 TB of data 3.  Use Spark + Tachyon to query 6 TB of data
  16. 16. Summary Spark and Tachyon provides significant performance advantage •  Suitable for latency sensitive workloads, such as interactive ad- hoc query Takes time for Spark to reach MR scale •  We still use MR for high throughput batch workloads •  Deployment Scale: Spark 1000 nodes vs. MR 13000 nodes •  Memory Management is one of the main blockers of scalability Look forward to Spark Improvements •  Project Tungsten: better memory management: main blocker of scalability •  Continuous improvement of Catalyst •  Support for multiple meta stores
  17. 17. Looking Into the Future Spark + TachyonSystems Software FPGA + GPU Acceleration Hardware Faster Network Spark SQL Applications DNN Spark R Spark MlLib Speed Functionality