Gdynia TECH Group
What is cool?
big data
distributed systems
libs (algorithms, collections, network, multithreading, serialization, ...)
patterns, methodologies, best practices
trends
technical presentations
hackathons
workshops
conferences/local events
What we want to do?
trainings
Upcoming presentations...
Distributed caching with HazelCast
Storm - real time stream processing
TDD - myth or good practice.
Handling failures in distributed systems
Serialization for everybody
Test your code. Always.
SQL Server Reporting Services - make your users happy and your
life easier
Upcoming presentations...
Reading (un)real-time feeds in Event Platform
Distributed computing and clustering done right
ActiveMQ usage in a SEM's Live Transcript process.
33 things we did wrong. EP lesson learned.
Who do it better? GitFlow implemented in EP and SEM.
Why Kafka is a standard?
Want to contribute? contact us
Q?
Introduction to Hadoop
Ecosystem
What is NoSQL?
NoSQL (often interpreted as Not only SQL[1][2]) database provides a
mechanism for storage and retrieval of data that is modeled in means other
than the tabular relations used in relational databases
What is Big Data?
10TB
Hadoop is Big Data !?
What is Hadoop?
Google released the
Google File System paper
in October 2003
Google released the
MapReduce paper
in December 2004
In 2006, Cutting went to work with Yahoo, which was
equally impressed by the Google File System and
MapReduce papers and wanted to build open source
technologies based on them
The transformation into Hadoop being “behind every click”
(or every batch process, technically) at Yahoo was pretty
much complete by 2008
By the time Yahoo spun out Hortonworks into a separate,
Hadoop-focused software company in 2011, Yahoo’s
Hadoop infrastructure consisted of 42,000 nodes and
hundreds of petabytes of storage
What is Hadoop?
Hadoop
Hadoop
HDFS
Map Reduce
Map Reduce
YARN
Other YARN applications
Storm
Spark
Tez
Samza
Impala
Hive
Hive is a data warehousing infrastructure based on
Hadoop. Hadoop provides massive scale out and fault
tolerance capabilities for data storage and processing
Example
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '1'
STORED AS SEQUENCEFILE;
Example
SELECT pv.*, u.gender, u.age, f.friends
FROM page_view pv JOIN user u ON (pv.userid = u.id) JOIN
friend_list f ON (u.id = f.uid)
WHERE pv.date = '2008-03-03';
Example
SELECT pv_users.gender, count(DISTINCT pv_users.userid),
count(*), sum(DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;
Pig
Pig is a high level scripting language that is used with
Apache Hadoop. Pig excels at describing data analysis
problems as data flows. Pig is complete in that you can do
all the required data manipulations in Apache Hadoop with
Pig
Example
players = load 'baseball' as (name:chararray, team:chararray,
position:bag{t:(p:chararray)}, bat:map[]);
noempty = foreach players generate name,
((position is null or IsEmpty(position)) ? {('unknown')} :
position)as position;
pos = foreach noempty generate name, flatten(position) as position;
bypos = group pos by position;
Example
players = load 'baseball' as (name:chararray, team:chararray,
position:bag{t:(p:chararray)}, bat:map[]);
noempty = foreach players generate name,
((position is null or IsEmpty(position)) ? {('unknown')} :
position)as position;
pos = foreach noempty generate name, flatten(position) as position;
bypos = group pos by position;
Other frameworks...
Apache Spark
Impala
Apache Tez
Apache Flink
Storm, Samza, Spark S, Flink S (real-time analytics)
HBase
When Would I Use Apache HBase?
Use Apache HBase™ when you need random, realtime read/write access to your
Big Data. This project's goal is the hosting of very large tables -- billions of rows X
millions of columns -- atop clusters of commodity hardware
Q?

Intro to hadoop ecosystem

  • 1.
  • 3.
    What is cool? bigdata distributed systems libs (algorithms, collections, network, multithreading, serialization, ...) patterns, methodologies, best practices trends
  • 9.
  • 13.
    Upcoming presentations... Distributed cachingwith HazelCast Storm - real time stream processing TDD - myth or good practice. Handling failures in distributed systems Serialization for everybody Test your code. Always. SQL Server Reporting Services - make your users happy and your life easier
  • 14.
    Upcoming presentations... Reading (un)real-timefeeds in Event Platform Distributed computing and clustering done right ActiveMQ usage in a SEM's Live Transcript process. 33 things we did wrong. EP lesson learned. Who do it better? GitFlow implemented in EP and SEM. Why Kafka is a standard?
  • 15.
  • 16.
  • 17.
  • 18.
  • 20.
    NoSQL (often interpretedas Not only SQL[1][2]) database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases
  • 21.
  • 23.
  • 24.
  • 25.
  • 26.
    Google released the GoogleFile System paper in October 2003
  • 28.
    Google released the MapReducepaper in December 2004
  • 30.
    In 2006, Cuttingwent to work with Yahoo, which was equally impressed by the Google File System and MapReduce papers and wanted to build open source technologies based on them
  • 31.
    The transformation intoHadoop being “behind every click” (or every batch process, technically) at Yahoo was pretty much complete by 2008
  • 32.
    By the timeYahoo spun out Hortonworks into a separate, Hadoop-focused software company in 2011, Yahoo’s Hadoop infrastructure consisted of 42,000 nodes and hundreds of petabytes of storage
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
    Hive is adata warehousing infrastructure based on Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing
  • 43.
    Example CREATE TABLE page_view(viewTimeINT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '1' STORED AS SEQUENCEFILE;
  • 44.
    Example SELECT pv.*, u.gender,u.age, f.friends FROM page_view pv JOIN user u ON (pv.userid = u.id) JOIN friend_list f ON (u.id = f.uid) WHERE pv.date = '2008-03-03';
  • 45.
    Example SELECT pv_users.gender, count(DISTINCTpv_users.userid), count(*), sum(DISTINCT pv_users.userid) FROM pv_users GROUP BY pv_users.gender;
  • 46.
  • 47.
    Pig is ahigh level scripting language that is used with Apache Hadoop. Pig excels at describing data analysis problems as data flows. Pig is complete in that you can do all the required data manipulations in Apache Hadoop with Pig
  • 48.
    Example players = load'baseball' as (name:chararray, team:chararray, position:bag{t:(p:chararray)}, bat:map[]); noempty = foreach players generate name, ((position is null or IsEmpty(position)) ? {('unknown')} : position)as position; pos = foreach noempty generate name, flatten(position) as position; bypos = group pos by position;
  • 49.
    Example players = load'baseball' as (name:chararray, team:chararray, position:bag{t:(p:chararray)}, bat:map[]); noempty = foreach players generate name, ((position is null or IsEmpty(position)) ? {('unknown')} : position)as position; pos = foreach noempty generate name, flatten(position) as position; bypos = group pos by position;
  • 50.
    Other frameworks... Apache Spark Impala ApacheTez Apache Flink Storm, Samza, Spark S, Flink S (real-time analytics)
  • 51.
  • 53.
    When Would IUse Apache HBase? Use Apache HBase™ when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware
  • 54.

Editor's Notes

  • #2 na poczatek troche was zmecze… odpowiemy sobie na kilka pytan… wiem, jakbyscie wiedzieli ze beda pytania, byscie nie przyszli…, dlatego dopiero teraz mowie
  • #3 Whoo do cool things?
  • #5 show ourselves outside the company, uwazacie ze nie ma nic ciekawego do pokazywania? no tak jak slysze ze testy nie maja sensu ponizej 10k kodu
  • #6 jezeli nie to sa dwie mozliwosci: albo nie macie racji albo cos generalnie jest nie tak
  • #8 to moze wynikac z roznych rzeczy: brak dzielenia sie wiedza - kazdy siedzi w swojej piaskownicy, kopie dolek lopatka, a w pokoju obok maja koparke
  • #11 1.wy jestescie naszymi przyszlymi prelegentami… :) 2. mozna sporo skozystac; -respect -presentation skills -przygotowanie prezentacji bywa bardzo ksztalcace -budowanie wlasnej marki -miejsce dla osob ktore maja ochote to zrobic na zewnatrz ale nie ma gdzie sprobowac - My zapewniamy wsparcie: -pomoc w przygotowaniu prezentacji -wybor tematu - chcecie ‘cos’ pokazac ale nie macie tematu, nie wiecie co moze interesowac inne osoby? znajdziemy wam temat
  • #37 HDFS