Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Real… Big… Data… 
and it’s constant evolution 
Scott MacGregor
Who is this guy?
Akamai Big Data Infrastructure 
150,000 collector nodes 
5000 map/reduce nodes 
Billions of jobs per day
What is Big Data?
The V’s
Data that is Big 
From Hortonworks
What’s it really about?
From the beginning… 
• Akamai needed a billing system and scalable monitoring 
• The Open Source community wanted a search...
Big Data timeline 
Akamai 
Wide area, real-time, in-memory system monitoring 
1998 2001 2003 2005 2006 2007 2008 2010 2011...
How it works…
Big Data modes 
• Batch 
– Computation over a large static data set 
– Results are complete 
• Online 
– Computation on da...
Big Data primitives 
• Collection 
• Parsing 
• Partitioning 
• Filtering 
• Throttling 
• Aggregation 
• Tracking 
• Vali...
Collection 
• What 
– Logs 
– Metadata 
– System stats 
– Application 
events 
– Application stats 
– Network data 
• How ...
Parsing 
• Read lines or blocks and split into fields 
• Transform, e.g. protobuf 
• Map keys to values 
S 1359487051.701 ...
Partitioning 
• Bucketing 
– Reduce to a single record per bucket 
– e.g. 5 minutes, /24, etc. 
• Hashing 
– Bucket blocks...
Filtering 
• Statistical Methods 
– Top-k (HierarchicalCountSketch) 
– Set membership (Bloom filters) 
– Cardinality count...
Throttling 
• Limit on cardinality per partition 
– Requires central management 
– Drop records over max 
• Remove or trim...
Aggregation 
• Merge 
– Merge-sort blocks in a partition 
• Reduce 
– Combine values for like keys 
• Sum, Min, Max, Mask,...
Tracking 
• Tracking 
– Embed GUID in each data unit sent 
– Publish GUIDs independent from data flow 
– Completeness is e...
Data integrity 
• Watermark 
– Producer watermarks every n-lines with a 
crypto key 
– Receiver checks watermarks 
• Check...
Analysis 
• Online 
– Precomputed reports 
• Batch 
– Spark Programs 
– Map/Reduce 
– Hive: HQL 
– SQL
Big Data at Akamai 
• Billing and Reporting 
• System monitoring 
• Media Analytics 
• Security 
• Log archive
Billing and reporting 
Logs 
Akamai Edge 
Networks and 
Products 
Q Parse 
Pipelines 
Aggregate 
Shuffle Split 
Billing DB...
System monitoring 
Akamai 
Networks and 
Products 
Client SQL 
Parser TLA Agg 
Agg 
Agg 
Alert 
Trend 
50M jobs/day 
TLA: ...
Media analytics 
Pipelines 
Akamai 
Products 
Front 
end 
Column Store 
Index Reporting 
RRepeoprotirntign g 
API / UI 
Cu...
Security products 
Akamai Edge 
Networks Front 
Pipelines 
end 
HDFS 
20 TB/day 
Events 
Akamai 
Web Firewall 
Map/Reduce ...
Log archive 
Logs 
Q Archive 
Parse 
180 PB, 450 Trillion records 
Doubles every year 
Log cache 10% 
Client IP Sketch 
Ar...
Hadoop / Yarn 
HDFS 
The Ecosystem 
Script 
Pig 
SQL 
Hive 
NoSQL 
HBASE 
Stream 
Kafka 
Storm 
Search 
Solr 
In-Mem 
Spar...
Hadoop / Yarn 
HDFS 
Building a system 
If you need fast access to massive amounts of data where queries 
are constrained ...
Building a system 
If you need to search logs: 
• Start with HDFS 
• Add Flume for log data integration 
• Add Avro for da...
Hadoop / Yarn 
HDFS 
Building a system 
If you need flexible and shared access to unlimited amounts of 
data: 
• Start wit...
Building a system 
If you need fast, flexible access to in-memory data: 
• Start with HDFS 
• Add Spark 
• Add Spark SQL f...
Building a system 
If you need real-time stream event processing: 
• Start with HDFS 
• Add Kafka for messaging and pub/su...
Future at Akamai 
• 100x 
– Everything bigger and faster 
– Requires new R&D across many Big Data 
components 
• Scaling B...
Thank You
Upcoming SlideShare
Loading in …5
×

JDD2014: Real Big Data - Scott MacGregor

372 views

Published on

-The evolution of Big Data, both inside Akamai and in the industry.
-The current Big Data Ecosystem with real-world examples.
-Challenges in Big Data and future directions.

Published in: Software
  • Be the first to comment

JDD2014: Real Big Data - Scott MacGregor

  1. 1. Real… Big… Data… and it’s constant evolution Scott MacGregor
  2. 2. Who is this guy?
  3. 3. Akamai Big Data Infrastructure 150,000 collector nodes 5000 map/reduce nodes Billions of jobs per day
  4. 4. What is Big Data?
  5. 5. The V’s
  6. 6. Data that is Big From Hortonworks
  7. 7. What’s it really about?
  8. 8. From the beginning… • Akamai needed a billing system and scalable monitoring • The Open Source community wanted a search engine • Yahoo needed better product analytics for page views • Google needed more scalable computation for ad management • Facebook needed real-time updates to social graph • LinkedIn needed a real-time activity data pipeline • Twitter needed hashtag and topic streams • Amazon needed durable shopping carts • Netflix needed a recommendation engine
  9. 9. Big Data timeline Akamai Wide area, real-time, in-memory system monitoring 1998 2001 2003 2005 2006 2007 2008 2010 2011 2012 2013 2014 Industry Generalized map/reduce on 1 machine Decentralized job scheduling Multiple machines File System DB Nutch Google FS Google MapReduce Neo4J Amazon Dynamo Yahoo spins off Hadoop NoSql Geographical redundancy Real-time reporting Columnar DB Distributed File System DB Wide-area MapReduce ExaByte Query HBASE LinkedIn Kafka Facebook Cassandra Twitter Storm Facebook Presto
  10. 10. How it works…
  11. 11. Big Data modes • Batch – Computation over a large static data set – Results are complete • Online – Computation on data as it’s generated – Localized results, must be aggregated downstream
  12. 12. Big Data primitives • Collection • Parsing • Partitioning • Filtering • Throttling • Aggregation • Tracking • Validation • Analysis
  13. 13. Collection • What – Logs – Metadata – System stats – Application events – Application stats – Network data • How – Email – SPDY – HTTP POST – SCP – Scribe – Avro – Custom
  14. 14. Parsing • Read lines or blocks and split into fields • Transform, e.g. protobuf • Map keys to values S 1359487051.701 4200191766 2097152 2097918 61.252.169.21 GET 440 1423 itunesus011.download.akamai.com 200 - iPeV image/jpeg - - 44 3031 - - - - - W - us/ r1000/011/Purple/53/e6/0f/mzl.slohufby.320x480-75.jpg - a440.phobos.apple.com 1359486900 1423 a440.phobos.apple.com 1 3158 1359486900 1423 200 1 30128 1359486900 1423 1 209158
  15. 15. Partitioning • Bucketing – Reduce to a single record per bucket – e.g. 5 minutes, /24, etc. • Hashing – Bucket blocks or records of data by a hash function
  16. 16. Filtering • Statistical Methods – Top-k (HierarchicalCountSketch) – Set membership (Bloom filters) – Cardinality counting (HyperLogLog) – Frequency estimates (CountSketch) – Change detection (Deltoid) • Sampling – Random – Reservoir
  17. 17. Throttling • Limit on cardinality per partition – Requires central management – Drop records over max • Remove or trim large fields S 1359487051.701 4200191766 2097152 2097918 61.252.169.21 GET 440 1423 itunesus011.download.akamai.com 200 - iPeV image/jpeg - - 44 3031 - - - - - W - us/ r1000/011/Purple/53/e6/0f/mzl.slohufby.320x480-75.jpg - a440.phobos.apple.com S 1359487051.701 4200191766 2097152 2097918 61.252.169.21 GET 440 1423 ~ 200 - iPeV image/jpeg - - 44 3031 - - - - - W - ~
  18. 18. Aggregation • Merge – Merge-sort blocks in a partition • Reduce – Combine values for like keys • Sum, Min, Max, Mask, etc. • Shuffle – Move the data to where its needed or closer to like data 1359486900 1423 1 209158 1359486900 1423 1 209158 1359529800 1423 1 209158 Aggregate 1359486900 1423 2 418316 1359529800 1423 1 209158 {1423, 1359486900} 2 418316 {1423, 1359529800} 1 209158 Shuffle
  19. 19. Tracking • Tracking – Embed GUID in each data unit sent – Publish GUIDs independent from data flow – Completeness is expected (published GUIDs) vs. actual (embedded GUID)
  20. 20. Data integrity • Watermark – Producer watermarks every n-lines with a crypto key – Receiver checks watermarks • Checksum – Block checksums – Line CRC – Etc.
  21. 21. Analysis • Online – Precomputed reports • Batch – Spark Programs – Map/Reduce – Hive: HQL – SQL
  22. 22. Big Data at Akamai • Billing and Reporting • System monitoring • Media Analytics • Security • Log archive
  23. 23. Billing and reporting Logs Akamai Edge Networks and Products Q Parse Pipelines Aggregate Shuffle Split Billing DB Reporting Reporting Parsing Reporting • splits lines into fields • maps keys to values per pipeline • each log generates many pipelines • each pipeline represents a streaming table Evolution • Logs were emailed (up to 1PB/day) • Now delivered via SPDY (3PB/day) Customers 3 PB/day Doubles every year Reporting ReIpnotertrinagl Apps
  24. 24. System monitoring Akamai Networks and Products Client SQL Parser TLA Agg Agg Agg Alert Trend 50M jobs/day TLA: top level aggregator pulls data from aggregators which pull data from producers at the time of the request Produces rewrite data locally Evolution Single machine memory for table joins Future: distributed memory for table joins
  25. 25. Media analytics Pipelines Akamai Products Front end Column Store Index Reporting RRepeoprotirntign g API / UI Customers Events Indexes are recreated for each update Supports insert and update Reads are flexible and fast Evolution: Index now fingerprint to lower cost Hyperloglog for uniqueness counting
  26. 26. Security products Akamai Edge Networks Front Pipelines end HDFS 20 TB/day Events Akamai Web Firewall Map/Reduce HBASE Hive Cloudera Graphite Operations Center Reputation Scoring Threat Analysis Intelligence Reports Risk Based Authentication Payment Fraud External Data External Data External Data Evolution: Replacing HBASE with custom aggregator Replacing Hive with custom SQL processor
  27. 27. Log archive Logs Q Archive Parse 180 PB, 450 Trillion records Doubles every year Log cache 10% Client IP Sketch Archive Index (10TB) Pipelines HDFS Spark Spark SQL Client Request Archive Front End Cache first Then archive Get Index and/or CIP Archive is 90 data centers distributed over wide area; projected 1.2 EB in 3 years Evolution: Was flat file for index, now HDFS/Spark
  28. 28. Hadoop / Yarn HDFS The Ecosystem Script Pig SQL Hive NoSQL HBASE Stream Kafka Storm Search Solr In-Mem Spark Integration Flume Avro Operations Ambari Zookeeper Oozie Monitoring Graphite Sharing Mesos
  29. 29. Hadoop / Yarn HDFS Building a system If you need fast access to massive amounts of data where queries are constrained to an index (read optimized): • Start with HDFS or Cassandra • Add HBASE column store • Add Hive for SQL-like access • Add Pig for scripting HBASE Get, Put Hive Select * Pig { … }
  30. 30. Building a system If you need to search logs: • Start with HDFS • Add Flume for log data integration • Add Avro for data serialization • Add Solr for search Hadoop / Yarn HDFS Solr Search, e.g. Ip = 1.1.1.1 Flume Agent Avro Sink Flume Collector Avro Source
  31. 31. Hadoop / Yarn HDFS Building a system If you need flexible and shared access to unlimited amounts of data: • Start with HDFS or Cassandra • Add Hadoop for Map/Reduce or • Add Hive for SQL-like access or • Add Pig for scripting • Add Mesos for resource sharing • Add Ambari for cluster management and provisioning • Add map/reduce programs for business logic Pig {…} Hive Flume Select * Ambari Mesos Map/Reduce Java { … }
  32. 32. Building a system If you need fast, flexible access to in-memory data: • Start with HDFS • Add Spark • Add Spark SQL for SQL-like access or • Create Spark programs for other business logic SparkSQL Select * from Spark Hadoop / Yarn HDFS Spark Progs Java { … }
  33. 33. Building a system If you need real-time stream event processing: • Start with HDFS • Add Kafka for messaging and pub/sub • Add Storm for event processing • Develop Java Bolts for processing logic Kafka Storm Bolts { … } Hadoop / Yarn HDFS
  34. 34. Future at Akamai • 100x – Everything bigger and faster – Requires new R&D across many Big Data components • Scaling Big Data Eco across wide-area • Internet Security • Positive reputation scoring • Automatic DDoS mitigation • Low latency data collection – 2^53 unique keys, <1 minute latency • Support DevOps – Near real-time monitoring and control
  35. 35. Thank You

×