Gyorgy balogh modern_big_data_technologies_sec_world_2014

555 views

Published on

György Balogh has held a presentation at the SECWorld 2014 conference about the cutting-edge yet also affordable Big Data technologies.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
555
On SlideShare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
12
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Gyorgy balogh modern_big_data_technologies_sec_world_2014

  1. 1. BIG DATA MODERN TECHNOLOGIES György Balogh LogDrill Ltd. SECWorld – 7 May 2014
  2. 2. AGENDA • What is Big Data? • Why do we have to talk about it? • Paradigm shift in informationmanagement • Technology and efficiency
  3. 3. WHAT IS BIG DATA? • Data volume cannot be handled traditional solutions (eg.: relational database) • More than 100 million data rows, typically multi billion
  4. 4. GLOBAL RATE OF DATA PRODUCTION (PER SECOND) • 30 TB/sec (22000 films) • Digital media • 2 hours of YouTube video • Communication • 3000 business emails • 300000 SMS • Web • Half million page views • Logs • Billions
  5. 5. BIG DATA MARKET
  6. 6. HYPE OR REALITY?
  7. 7. WHY NOW? ● Long term trends ○ Size of stored data doubles every 40 months since 1980s ○ Moore’s law: number of transistors on integrated circuits doubles every 18 months
  8. 8. DIFFERENT EXPONENTIAL TRENDS
  9. 9. HARD DRIVES IN 1991 AND 2012 ● 1991 ● 40 MB ● 3500 RPM ● 0.7 MB/sec ● full scan: 1 minutes ● 2012 ● 4 TB ( x 100000) ● 7200 RPM ● 120 MB/sec ( x 170) ● full scan: 8 hours (x 480)
  10. 10. DATA ACCESS BECOMES THE SCARCE RESOURCE!
  11. 11. GOOGLE’S HARDWARE IN 1998
  12. 12. GOOGLE’S HARDWARE IN 2013 • 12 data centers worldwide • More than a million nodes • A data center costs $600 million to build • Oregon data center • 15000 m2 • power of 30 000 homes
  13. 13. GOOGLE’S HARDWARE IN 2013 • Cheap commodity hardware • each has its own battery! • Modular data centers • Standard container • 1160 servers per container • Efficiency: 11% overhead (power transformation, cooling)
  14. 14. THE BIG DATA PARADIGM SHIFT
  15. 15. TECHNOLOGIES • Hadoop 2.0 • Google BigQuery • Cloudera Impala • Apache Spark
  16. 16. HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
  17. 17. HADOOP MAP REDUCE
  18. 18. HADOOP • Who uses Hadoop? • Facebook: 100 PB • Yahoo: 4000 nodes • More than half of Fortune 50 companies! • History • Replica of Google architecture (GFS, BigTable) in Java under Apache licence • Hadoop 2.0 • Full High Availability • Advanced resource managements (YARN)
  19. 19. GOOGLE BIG QUERY • SQL queries on terabytes of data in seconds • Data is distributed over thousands of nodes • Each node processes one part of the dataset • Thousands of nodes work for us for a few milliseconds select year, SUM(mother_age * record_weight) / SUM(record_weight) as age from publicdata:samples.natality where ever_born = 1 group by year order by year;
  20. 20. GOOGLE BIG QUERY • SQL queries on terabytes of data in seconds • Data is distributed over thousands of nodes • Each node processes one part of the dataset • Thousands of nodes work for us for a few milliseconds
  21. 21. CLOUDERA IMPALA • Same as BigQuery on top of Hadoop • Standard SQL on Big Data. • On a 10 million Ft cluster terabytes of data can be analyzed interactively • Scales to thousands of nodes • Technology sugars • Run-time code generation with LLVM • Parquet format (column oriented)
  22. 22. APACHE SPARK • Berkeley University • Achieves 100 times speed up compared to Hadoop on certain tasks • In cluster memory computation
  23. 23. INEFFICIENCY CAN WASTE HUGE AMOUNT OF RESOURCES • 300 node cluster • Hadoop • Hive = • 300 node cluster • One node • Vectorwise • Vectorwise holds world speed record in analytical database queries on a single node
  24. 24. CLEVER WAYS TO IMPROVE EFFICIENCY • Lossless data compression (even 50x!) • Clever lossy compression of data (e.g.: olap cubes) • Cache aware implementations (asymmetric trends, memory access bottleneck)
  25. 25. LOSSLESS DATA COMPRESSION • compression can boost sequential data access even 50 times! (100 MB/sec -> 5 GB/sec) • Less data -> less I/O operation • One CPU can decompress data even at 5 GB/sec • gzip decompression is very slow • snappy, lzo, lz4 can reach 1 GB/sec decompression speed • decompression used by column oriented databases can reach 5 GB/sec (PFOR) • two billion integers per second! (almost one integer per clock cycle!!!)
  26. 26. EXAMPLE: LOGDRILL 2011-01-08 00:00:01 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 22957 562 2011-01-08 00:00:09 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 2957 321 2011-01-08 00:01:04 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 43422 522 2011-01-08 00:01:08 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 234 425 2011-01-08 00:02:23 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 404 0 0 234 432 2011-01-08 00:02:45 X1 Y1 1.2.3.4 POST /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 4353 134 2011-01-08 00:00 GET 200 2 2011-01-08 00:01 GET 200 2 2011-01-08 00:02 GET 404 1 2011-01-08 00:02 POST 200 1
  27. 27. CAHE AWARE PROGRAMMING • CPU speed increasing about 60% a year • Memory speed increasing only 10% a year • The increasing gap is covered with multi level cache memories • Cache is under-exploited 100x speed up!!!
  28. 28. LESSONS LEARNED • Big Data is not a hype at least from the technological viewpoint • Modern technologies (Impala, Spark) can reach theoretical limits of the cluster hardware configuration • Deep understanding of both the problem and the technologies are required to create efficient Big Data solutions
  29. 29. THANK YOU! Q&A?

×