MapReduce@DirectI

•Download as PPTX, PDF•

2 likes•719 views

The initial simple MapReduce cluster setup at DirectI. An introduction to MapReduce and Hadoop. A brief intro Pig is also included.

Technology

MapReduce@DirectI

amkiray: ramki.g@directi.com
uvdhray: dhruv.m@directi.com

Lets start with an example…

access.log
timestamp,url,response_code,response_time

products.dat
date, product_id, price

Requirement:

Number of requests in the last 30 days.


$> ls –rt *.log | tail -30 | xargs “wc –l”

Requirement:

Busiest 30 minutes in last 30 days.


$> ls –rt *.log| tail -30 | xargs “./count_30min.sh“

Requirement:

Number of failed buy requests for products worth

more than $30 in the last 30 days .

Import data to an RDBMS;

SELECT COUNT(*) FROM logs, products
WHERE GET_REQUEST_TYPE(logs.url)=„BUY‟
AND
GET_PRODUCT_ID(logs.url)=products.product_id
AND product.price>30
AND DATE(log.timestamp) = products.date;

Now gimme the number of

failed buy requests for products
worth more than $30 in the
It might take a
last 1 Year.
while!

2 days later….

On its way!
Inserting data
into database…

5 days later…

$> mysqladmin processlist
+-----------------------------------+
| Query | Copy to Temp Table |
+-----------------------------------+

May be its Joining!!
Or may be its dead…
Or may be my replacement will see the result .. 

Hadoop god@internal.directi.com
Go use
bloody!

But Why?!
Distributed data processing cluster

A distributed file system

Data location sensitive Task scheduler

MapReduce paradigm

Handles Parallelizable and distributable tasks

Failover capability

Web based monitoring capabilities

More value for your time!


Where to start??

MapReduce:

Q: Sum of all squares of a=[1,2,3,4,3,2,7]

Simple….


fold( map(a, square()), sum())

You can do that in any functional programming language….

Now do it for an array
of 100 million
elements…….

This is where Hadoop comes in..
Distributed File System : HDFS

Distributed Computation/Task Processing: Hadoop

Name Node + Data Node

Task Tracker + Job Tracker


An example
Word Count


hadoop jar contrib/streaming/hadoop-0.*-streaming.jar
-jobconf mapred.data.field.separator=quot;,quot;
-input 'wc.eg.in'
-output 'wc.eg.out'
-mapper 'wc -w'
-reducer quot;awk „{ sum+=$1 } END{ print sum}‟quot;

Task Tracker: http://cae5.internal.directi.com:50030/jobtracker.jsp
Name Node: http://cae2.internal.directi.com:50070/dfshealth.jsp

Pig and Pig Latin
A procedural language for MapReduce operations.


logs = LOAD 'access.logs' USING PigStorage(',') AS
(ts:int, URL:chararray, resp:chararray, resp_time:int);

products = LOAD 'products.dat' USING PigStorage(',') AS
(date:int, pid:int, price:int);

l1 = FOREACH logs GENERATE GetDate(ts) as req_date,
GetProductID(URL) as prod_id,
GetRequestType(URL) as rtype, resp, resp_time;

j1 = JOIN l1 BY prod_id, products by pid;

j2 = FILTER j1 BY req_date==date AND price>30.0F;

j3 = GROUP j2 ALL;

j4 = FOREACH j3 GENERATE COUNT(j2);

DUMP j4

What's hot

Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlareAltinity Ltd

PostgreSQL Meetup Berlin at Zalando HQPostgreSQL-Consulting

Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQRick Copeland

Unified Data Platform, by Pauline Yeung of Cisco SystemsAltinity Ltd

Raw system logs processing with hiveArpit Patil

Monitoring MySQL with OpenTSDBGeoffrey Anderson

PostgreSQL Administration for System AdministratorsCommand Prompt., Inc

GCPUG meetup 201610 - Dataflow IntroductionSimon Su

Pgcenter overviewAlexey Lesovsky

Webinar: Secrets of ClickHouse Query Performance, by Robert HodgesAltinity Ltd

ClickHouse Deep Dive, by Aleksei MilovidovAltinity Ltd

Google App Engine BeCamp 2008atonse

Upgrading To The New Map Reduce APITom Croucher

ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...Altinity Ltd

ClickHouse Features for Advanced Users, by Aleksei MilovidovAltinity Ltd

Our Story With ClickHouse at seo.doMetehan Çetinkaya

ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEOAltinity Ltd

20181116 Massive Log Processing using I/O optimized PostgreSQLKohei KaiGai

Troubleshooting PostgreSQL Streaming ReplicationAlexey Lesovsky

Prashant de-ny-project-s1Prashant Ratnaparkhi

What's hot (20)

Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare

PostgreSQL Meetup Berlin at Zalando HQ

Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ

Unified Data Platform, by Pauline Yeung of Cisco Systems

Raw system logs processing with hive

Monitoring MySQL with OpenTSDB

PostgreSQL Administration for System Administrators

GCPUG meetup 201610 - Dataflow Introduction

Pgcenter overview

Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges

ClickHouse Deep Dive, by Aleksei Milovidov

Google App Engine BeCamp 2008

Upgrading To The New Map Reduce API

ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...

ClickHouse Features for Advanced Users, by Aleksei Milovidov

Our Story With ClickHouse at seo.do

ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO

20181116 Massive Log Processing using I/O optimized PostgreSQL

Troubleshooting PostgreSQL Streaming Replication

Prashant de-ny-project-s1

Similar to MapReduce@DirectI

03 pig introSubhas Kumar Ghosh

BuildingsocialanalyticstoolwithmongodbMongoDB APAC

Xadoop - new approaches to data analyticsMaxim Grinev

Data Analytics Service Company and Its Ruby UsageSATOSHI TAGOMORI

Why and How Powershell will rule the Command Line - Barcamp LA 4Ilya Haykinson

Hadoop IntroductionSNEHAL MASNE

Daniel Sikar: Hadoop MapReduce - 06/09/2010 Skills Matter

Hadoop mapreduce user_group_daniel_sikar_presentation_06.09.2010Skills Matter

Dynamic Tracing of your AMP web siteSriram Natarajan

Gur1009Cdiscount

Osd ctw sparkWisely chen

Aws Quick Dirty Hadoop Mapreduce Ec2 S3Skills Matter

Handout3oShahbaz Sidhu

SF Big Analytics meetup : Hoodie From UberChester Chen

R the unsung hero of Big DataDhafer Malouche

Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Sumeet Singh

Everything is Awesome - Cutting the Corners off the WebJames Rakich

Big data week presentationJoseph Adler

Hive dirty/beautiful hacks in TDSATOSHI TAGOMORI

10 things I learned building Nomad packsBram Vogelaar

Similar to MapReduce@DirectI (20)

03 pig intro

Buildingsocialanalyticstoolwithmongodb

Xadoop - new approaches to data analytics

Data Analytics Service Company and Its Ruby Usage

Why and How Powershell will rule the Command Line - Barcamp LA 4

Hadoop Introduction

Daniel Sikar: Hadoop MapReduce - 06/09/2010

Hadoop mapreduce user_group_daniel_sikar_presentation_06.09.2010

Dynamic Tracing of your AMP web site

Gur1009

Osd ctw spark

Aws Quick Dirty Hadoop Mapreduce Ec2 S3

Handout3o

SF Big Analytics meetup : Hoodie From Uber

R the unsung hero of Big Data

Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...

Everything is Awesome - Cutting the Corners off the Web

Big data week presentation

Hive dirty/beautiful hacks in TD

10 things I learned building Nomad packs

Recently uploaded

GenCyber Cyber Security Day PresentationMichael W. Hawkins

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

Artificial intelligence in the post-deep learning eraDeakin University

Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Slack Application Development 101 Slidespraypatel2

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Install Stable Diffusion in windows machinePadma Pradeep

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

Vulnerability_Management_GRC_by Sohang Sengupta.pptxnull - The Open Security Community

The transition to renewables in India.pdfCompetition Advisory Services (India) LLP

Recently uploaded (20)

GenCyber Cyber Security Day Presentation

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

Artificial intelligence in the post-deep learning era

Next-generation AAM aircraft unveiled by Supernal, S-A2

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

08448380779 Call Girls In Friends Colony Women Seeking Men

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx

Injustice - Developers Among Us (SciFiDevCon 2024)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

Slack Application Development 101 Slides

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

Human Factors of XR: Using Human Factors to Design XR Systems

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

Install Stable Diffusion in windows machine

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

08448380779 Call Girls In Civil Lines Women Seeking Men

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

Vulnerability_Management_GRC_by Sohang Sengupta.pptx

The transition to renewables in India.pdf

MapReduce@DirectI

1. MapReduce@DirectI amkiray: ramki.g@directi.com uvdhray: dhruv.m@directi.com

2. Lets start with an example… access.log timestamp,url,response_code,response_time products.dat date, product_id, price

3. Requirement:  Number of requests in the last 30 days.  $> ls –rt *.log | tail -30 | xargs “wc –l”

4. Requirement:  Busiest 30 minutes in last 30 days.  $> ls –rt *.log| tail -30 | xargs “./count_30min.sh“

5. Requirement:  Number of failed buy requests for products worth  more than $30 in the last 30 days . Import data to an RDBMS; SELECT COUNT(*) FROM logs, products WHERE GET_REQUEST_TYPE(logs.url)=„BUY‟ AND GET_PRODUCT_ID(logs.url)=products.product_id AND product.price>30 AND DATE(log.timestamp) = products.date;

6. Now gimme the number of  failed buy requests for products worth more than $30 in the It might take a last 1 Year. while!

7. 2 days later…. On its way! Inserting data into database…

8. 5 days later… $> mysqladmin processlist +-----------------------------------+ | Query | Copy to Temp Table | +-----------------------------------+ May be its Joining!! Or may be its dead… Or may be my replacement will see the result .. 

9. Hadoop god@internal.directi.com Go use bloody!

10. But Why?! Distributed data processing cluster  A distributed file system  Data location sensitive Task scheduler  MapReduce paradigm  Handles Parallelizable and distributable tasks  Failover capability  Web based monitoring capabilities  More value for your time! 

11. Where to start?? MapReduce:  Q: Sum of all squares of a=[1,2,3,4,3,2,7]  Simple….  fold( map(a, square()), sum()) You can do that in any functional programming language….

12. Now do it for an array of 100 million elements…….

13. This is where Hadoop comes in.. Distributed File System : HDFS  Distributed Computation/Task Processing: Hadoop  Name Node + Data Node  Task Tracker + Job Tracker 

14.

15. Task Tracker

16. How MapReduce Works…

17. An example Word Count  hadoop jar contrib/streaming/hadoop-0.*-streaming.jar -jobconf mapred.data.field.separator=quot;,quot; -input 'wc.eg.in' -output 'wc.eg.out' -mapper 'wc -w' -reducer quot;awk „{ sum+=$1 } END{ print sum}‟quot; Task Tracker: http://cae5.internal.directi.com:50030/jobtracker.jsp Name Node: http://cae2.internal.directi.com:50070/dfshealth.jsp

18. Pig and Pig Latin A procedural language for MapReduce operations.  logs = LOAD 'access.logs' USING PigStorage(',') AS (ts:int, URL:chararray, resp:chararray, resp_time:int); products = LOAD 'products.dat' USING PigStorage(',') AS (date:int, pid:int, price:int); l1 = FOREACH logs GENERATE GetDate(ts) as req_date, GetProductID(URL) as prod_id, GetRequestType(URL) as rtype, resp, resp_time; j1 = JOIN l1 BY prod_id, products by pid; j2 = FILTER j1 BY req_date==date AND price>30.0F; j3 = GROUP j2 ALL; j4 = FOREACH j3 GENERATE COUNT(j2); DUMP j4

19. Q&A

MapReduce@DirectI

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MapReduce@DirectI

Similar to MapReduce@DirectI (20)

More from Directi Group

More from Directi Group (20)

Recently uploaded

Recently uploaded (20)

MapReduce@DirectI