• Like
  • Save
Gyorgy balogh modern_big_data_technologies_sec_world_2014
Upcoming SlideShare
Loading in...5
×
 

Gyorgy balogh modern_big_data_technologies_sec_world_2014

on

  • 156 views

György Balogh has held a presentation at the SECWorld 2014 conference about the cutting-edge yet also affordable Big Data technologies.

György Balogh has held a presentation at the SECWorld 2014 conference about the cutting-edge yet also affordable Big Data technologies.

Statistics

Views

Total Views
156
Views on SlideShare
144
Embed Views
12

Actions

Likes
1
Downloads
5
Comments
0

1 Embed 12

http://192.168.6.179 12

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Gyorgy balogh modern_big_data_technologies_sec_world_2014 Gyorgy balogh modern_big_data_technologies_sec_world_2014 Presentation Transcript

    • BIG DATA MODERN TECHNOLOGIES György Balogh LogDrill Ltd. SECWorld – 7 May 2014
    • AGENDA • What is Big Data? • Why do we have to talk about it? • Paradigm shift in informationmanagement • Technology and efficiency
    • WHAT IS BIG DATA? • Data volume cannot be handled traditional solutions (eg.: relational database) • More than 100 million data rows, typically multi billion
    • GLOBAL RATE OF DATA PRODUCTION (PER SECOND) • 30 TB/sec (22000 films) • Digital media • 2 hours of YouTube video • Communication • 3000 business emails • 300000 SMS • Web • Half million page views • Logs • Billions
    • BIG DATA MARKET
    • HYPE OR REALITY?
    • WHY NOW? ● Long term trends ○ Size of stored data doubles every 40 months since 1980s ○ Moore’s law: number of transistors on integrated circuits doubles every 18 months
    • DIFFERENT EXPONENTIAL TRENDS
    • HARD DRIVES IN 1991 AND 2012 ● 1991 ● 40 MB ● 3500 RPM ● 0.7 MB/sec ● full scan: 1 minutes ● 2012 ● 4 TB ( x 100000) ● 7200 RPM ● 120 MB/sec ( x 170) ● full scan: 8 hours (x 480)
    • DATA ACCESS BECOMES THE SCARCE RESOURCE!
    • GOOGLE’S HARDWARE IN 1998
    • GOOGLE’S HARDWARE IN 2013 • 12 data centers worldwide • More than a million nodes • A data center costs $600 million to build • Oregon data center • 15000 m2 • power of 30 000 homes
    • GOOGLE’S HARDWARE IN 2013 • Cheap commodity hardware • each has its own battery! • Modular data centers • Standard container • 1160 servers per container • Efficiency: 11% overhead (power transformation, cooling)
    • THE BIG DATA PARADIGM SHIFT
    • TECHNOLOGIES • Hadoop 2.0 • Google BigQuery • Cloudera Impala • Apache Spark
    • HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
    • HADOOP MAP REDUCE
    • HADOOP • Who uses Hadoop? • Facebook: 100 PB • Yahoo: 4000 nodes • More than half of Fortune 50 companies! • History • Replica of Google architecture (GFS, BigTable) in Java under Apache licence • Hadoop 2.0 • Full High Availability • Advanced resource managements (YARN)
    • GOOGLE BIG QUERY • SQL queries on terabytes of data in seconds • Data is distributed over thousands of nodes • Each node processes one part of the dataset • Thousands of nodes work for us for a few milliseconds select year, SUM(mother_age * record_weight) / SUM(record_weight) as age from publicdata:samples.natality where ever_born = 1 group by year order by year;
    • GOOGLE BIG QUERY • SQL queries on terabytes of data in seconds • Data is distributed over thousands of nodes • Each node processes one part of the dataset • Thousands of nodes work for us for a few milliseconds
    • CLOUDERA IMPALA • Same as BigQuery on top of Hadoop • Standard SQL on Big Data. • On a 10 million Ft cluster terabytes of data can be analyzed interactively • Scales to thousands of nodes • Technology sugars • Run-time code generation with LLVM • Parquet format (column oriented)
    • APACHE SPARK • Berkeley University • Achieves 100 times speed up compared to Hadoop on certain tasks • In cluster memory computation
    • INEFFICIENCY CAN WASTE HUGE AMOUNT OF RESOURCES • 300 node cluster • Hadoop • Hive = • 300 node cluster • One node • Vectorwise • Vectorwise holds world speed record in analytical database queries on a single node
    • CLEVER WAYS TO IMPROVE EFFICIENCY • Lossless data compression (even 50x!) • Clever lossy compression of data (e.g.: olap cubes) • Cache aware implementations (asymmetric trends, memory access bottleneck)
    • LOSSLESS DATA COMPRESSION • compression can boost sequential data access even 50 times! (100 MB/sec -> 5 GB/sec) • Less data -> less I/O operation • One CPU can decompress data even at 5 GB/sec • gzip decompression is very slow • snappy, lzo, lz4 can reach 1 GB/sec decompression speed • decompression used by column oriented databases can reach 5 GB/sec (PFOR) • two billion integers per second! (almost one integer per clock cycle!!!)
    • EXAMPLE: LOGDRILL 2011-01-08 00:00:01 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 22957 562 2011-01-08 00:00:09 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 2957 321 2011-01-08 00:01:04 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 43422 522 2011-01-08 00:01:08 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 234 425 2011-01-08 00:02:23 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 404 0 0 234 432 2011-01-08 00:02:45 X1 Y1 1.2.3.4 POST /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 4353 134 2011-01-08 00:00 GET 200 2 2011-01-08 00:01 GET 200 2 2011-01-08 00:02 GET 404 1 2011-01-08 00:02 POST 200 1
    • CAHE AWARE PROGRAMMING • CPU speed increasing about 60% a year • Memory speed increasing only 10% a year • The increasing gap is covered with multi level cache memories • Cache is under-exploited 100x speed up!!!
    • LESSONS LEARNED • Big Data is not a hype at least from the technological viewpoint • Modern technologies (Impala, Spark) can reach theoretical limits of the cluster hardware configuration • Deep understanding of both the problem and the technologies are required to create efficient Big Data solutions
    • THANK YOU! Q&A?