Intro to hadoop ecosystem

What is cool?
big data
distributed systems
libs (algorithms, collections, network, multithreading, serialization, ...)
patterns, methodologies, best practices
trends

technical presentations
hackathons
workshops
conferences/local events
What we want to do?
trainings

Upcoming presentations...
Distributed caching with HazelCast
Storm - real time stream processing
TDD - myth or good practice.
Handling failures in distributed systems
Serialization for everybody
Test your code. Always.
SQL Server Reporting Services - make your users happy and your
life easier

Upcoming presentations...
Reading (un)real-time feeds in Event Platform
Distributed computing and clustering done right
ActiveMQ usage in a SEM's Live Transcript process.
33 things we did wrong. EP lesson learned.
Who do it better? GitFlow implemented in EP and SEM.
Why Kafka is a standard?

Want to contribute? contact us

Introduction to Hadoop
Ecosystem

NoSQL (often interpreted as Not only SQL[1][2]) database provides a
mechanism for storage and retrieval of data that is modeled in means other
than the tabular relations used in relational databases

Google released the
Google File System paper
in October 2003

Google released the
MapReduce paper
in December 2004

In 2006, Cutting went to work with Yahoo, which was
equally impressed by the Google File System and
MapReduce papers and wanted to build open source
technologies based on them

The transformation into Hadoop being “behind every click”
(or every batch process, technically) at Yahoo was pretty
much complete by 2008

By the time Yahoo spun out Hortonworks into a separate,
Hadoop-focused software company in 2011, Yahoo’s
Hadoop infrastructure consisted of 42,000 nodes and
hundreds of petabytes of storage

Other YARN applications
Storm
Spark
Tez
Samza
Impala

Hive is a data warehousing infrastructure based on
Hadoop. Hadoop provides massive scale out and fault
tolerance capabilities for data storage and processing

Example
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '1'
STORED AS SEQUENCEFILE;

Example
SELECT pv.*, u.gender, u.age, f.friends
FROM page_view pv JOIN user u ON (pv.userid = u.id) JOIN
friend_list f ON (u.id = f.uid)
WHERE pv.date = '2008-03-03';

Example
SELECT pv_users.gender, count(DISTINCT pv_users.userid),
count(*), sum(DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;

Pig is a high level scripting language that is used with
Apache Hadoop. Pig excels at describing data analysis
problems as data flows. Pig is complete in that you can do
all the required data manipulations in Apache Hadoop with
Pig

Example
players = load 'baseball' as (name:chararray, team:chararray,
position:bag{t:(p:chararray)}, bat:map[]);
noempty = foreach players generate name,
((position is null or IsEmpty(position)) ? {('unknown')} :
position)as position;
pos = foreach noempty generate name, flatten(position) as position;
bypos = group pos by position;

Other frameworks...
Apache Spark
Impala
Apache Tez
Apache Flink
Storm, Samza, Spark S, Flink S (real-time analytics)

When Would I Use Apache HBase?
Use Apache HBase™ when you need random, realtime read/write access to your
Big Data. This project's goal is the hosting of very large tables -- billions of rows X
millions of columns -- atop clusters of commodity hardware

Intro to hadoop ecosystem

More Related Content

What's hot

Similar to Intro to hadoop ecosystem

Recently uploaded

Intro to hadoop ecosystem

Editor's Notes