Big Data: A New Frontier
Alex Cheng, VP Baidu
2013-4-12
5 billion+
Search
Queries
~4 million
Posts on
PostBar
~500 million
Users
100 million+
Mobile
Search
Users
~500,000
Business
Clients
Everyday
at
Storage	
  
Processing	
  
Analy1cs	
  
&	
  
Predic1on	
  
Data	
  
Intelligence	
  Volume	
  
	
  
Velocity	
  
	
  
Variety	
  	
  
	
  
Value	
  
Web	
  Pages	
  &	
  Links	
  
100+	
  PB	
   Logs	
  100+	
  PB	
  UGC	
  1	
  PB	
  
Web	
  
News	
  
PostBar	
   Encyclopedia	
  
Knows	
  
Searches,	
  Clicks,	
  
Posts	
  etc.	
  
1 petabyte = 2x National Library of China
Logs	
  
100+	
  PB	
  
UGC	
  1+	
  PB	
  
2005
2006
2007
2008
2009
2010
2011
2012
100	
  PB	
   100	
  PB	
   100	
  PB	
  
100	
  PB	
   100	
  PB	
   100	
  PB	
   100	
  PB	
   100	
  PB	
  
•  95%	
  of	
  the	
  data	
  was	
  created	
  
within	
  the	
  last	
  3	
  years	
  
•  100	
  PB	
  of	
  new	
  data	
  is	
  processed	
  everyday	
  
100	
  PB	
   100	
  PB	
   100	
  PB	
   100	
  PB	
   100	
  PB	
  
100	
  PB	
   100	
  PB	
   100	
  PB	
   100	
  PB	
  100	
  PB	
  
100	
  PB	
  100	
  PB	
  
Growth	
  :	
  100%+	
  YoY	
  
Hardware
Innovations
•  Custom ARM-based
Servers
•  Gigabit Switches
•  Custom SSD/Flash
Storage
TCO -25%
Density +70%
PUE 1.18 / 1.37 (#1)
Non-cooling hours 48%
Custom Rack
Uptime Efficiency 10x
Performance 2x
Cost -48%
Baidu Cloud IDC
Yangquan, Shanxi, China
Software
Innovations
•  Global Optimization
•  Multiple Replication
•  Data Distribution
•  Partial Update
MONOLITHIC HW
TRADITIONAL
RELATIONAL
DATABASE
DIRECT RECORD ACCESS OR QUERIES
TRADITIONAL	
  
SERVER	
  STACK	
  
MAPREDUCE
NOSQL
DATABASE
PARALLEL
RELATIONAL
DATABASE
HADOOP
DISTRIBUTED HARDWARE
NEW	
  SERVER	
  STACK	
  
•  Real-time online learning
•  Tens of billions training
samples
•  Billions of complex features
Feature
extraction
Model
Training
Models
Query
Advanced
Search
Module
CTR-server
Logs
Offline
Online
Big	
  Data	
   +	
   Web	
  Search	
  
•  Real-­‐Rme	
  DicRonary	
  Updates	
  
•  Dynamic	
  Result	
  Modeling	
  
•  High-­‐frequency	
  Inputs	
  
RecommendaRon	
  
	
  
	
  
Big	
  Data	
   +	
   IME	
  
User
Input
NLP
Module
Consolidated
Search Result
On-Device
Quick
Search
Cloud-
based
Dictionary
Device-
based
Dictionary
Output
Voice
Images
•  10+ Billions Training Examples
•  Heterogeneous Features
•  Intensive Computing
Deep Learning
The	
  Future	
  of	
  Big	
  Data	
   “Digital	
  Universe”	
  	
  	
  
2009	
   2010	
   2011	
   2012	
   2013	
   2014	
   2015	
   2016	
   2017	
   2018	
   2019	
   2020	
  
20,000	
  
40,000	
  
10,000	
  
30,000	
  
exabytes
Machine-generated
Sensor Data
“Anytime,
Anywhere,
Any Devices”
Smartphone
Smart Home
Wearable Devices
Smart Car
… …

Alex Cheng of Baidu: "Big Data: A New Frontier"

  • 1.
    Big Data: ANew Frontier Alex Cheng, VP Baidu 2013-4-12
  • 3.
    5 billion+ Search Queries ~4 million Postson PostBar ~500 million Users 100 million+ Mobile Search Users ~500,000 Business Clients Everyday at
  • 4.
    Storage   Processing   Analy1cs   &   Predic1on   Data   Intelligence  Volume     Velocity     Variety       Value  
  • 5.
    Web  Pages  &  Links   100+  PB   Logs  100+  PB  UGC  1  PB   Web   News   PostBar   Encyclopedia   Knows   Searches,  Clicks,   Posts  etc.   1 petabyte = 2x National Library of China
  • 6.
    Logs   100+  PB   UGC  1+  PB   2005 2006 2007 2008 2009 2010 2011 2012 100  PB   100  PB   100  PB   100  PB   100  PB   100  PB   100  PB   100  PB   •  95%  of  the  data  was  created   within  the  last  3  years   •  100  PB  of  new  data  is  processed  everyday   100  PB   100  PB   100  PB   100  PB   100  PB   100  PB   100  PB   100  PB   100  PB  100  PB   100  PB  100  PB   Growth  :  100%+  YoY  
  • 7.
    Hardware Innovations •  Custom ARM-based Servers • Gigabit Switches •  Custom SSD/Flash Storage TCO -25% Density +70% PUE 1.18 / 1.37 (#1) Non-cooling hours 48% Custom Rack Uptime Efficiency 10x Performance 2x Cost -48%
  • 8.
  • 9.
    Software Innovations •  Global Optimization • Multiple Replication •  Data Distribution •  Partial Update MONOLITHIC HW TRADITIONAL RELATIONAL DATABASE DIRECT RECORD ACCESS OR QUERIES TRADITIONAL   SERVER  STACK   MAPREDUCE NOSQL DATABASE PARALLEL RELATIONAL DATABASE HADOOP DISTRIBUTED HARDWARE NEW  SERVER  STACK  
  • 10.
    •  Real-time onlinelearning •  Tens of billions training samples •  Billions of complex features Feature extraction Model Training Models Query Advanced Search Module CTR-server Logs Offline Online Big  Data   +   Web  Search  
  • 11.
    •  Real-­‐Rme  DicRonary  Updates   •  Dynamic  Result  Modeling   •  High-­‐frequency  Inputs   RecommendaRon       Big  Data   +   IME   User Input NLP Module Consolidated Search Result On-Device Quick Search Cloud- based Dictionary Device- based Dictionary Output
  • 12.
    Voice Images •  10+ BillionsTraining Examples •  Heterogeneous Features •  Intensive Computing Deep Learning
  • 13.
    The  Future  of  Big  Data   “Digital  Universe”       2009   2010   2011   2012   2013   2014   2015   2016   2017   2018   2019   2020   20,000   40,000   10,000   30,000   exabytes Machine-generated Sensor Data “Anytime, Anywhere, Any Devices” Smartphone Smart Home Wearable Devices Smart Car … …