Distributed Data Processing Workshop - SBU

•

3 likes•9,204 views

Amir Sedighi

This presentation is about how to prepare a distributed data processing environment on your PC.

Data & Analytics

1
کارگاه پردازش داده توزیع شده
پردیس- شهیدبهشتی
دانشکده علوم و مهندسی کامپیوتر
درس: پایگاه داده توزیع شده
استاد: دکتر هادی طباطبایی
ارائه: ابوالفضل صدیقی
آبان ۱۳۹۳

Distributed Data Processing
School of Computer Science and Engineering
A. Sedighi
@amirsedighi
Hexican.com
sedighi@gmail.com

What can I do on a Single Machine?
5
● MVC Programming
● Regular Biz Apps
● 100 GBs Data
● Web Surfing
● ...

9
Introduction
This is a 4 sessions, hands-on, step-by-step
tutorial on setting up, a Linux cluster on your
machine (Notebook or PC), to try a few number
of big-data processing frameworks and tools.

10
What we are going to do?
● Your notebook, or a PC is just enough for starting.
– Setting your Linux cluster up.
● Distributed Log Management and Realtime Search-Engines
– What is Elasticsearch?
– Elasticsearch on the cluster.
– Monitoring and Usage.
● The most popular Distributed Data Processing Framework.
– What is Apache Hadoop?
– Apache Hadoop on the cluster.
– Using Scenarios.

11
What we would Learn?
● Leveraging our knowledge of Big-Data.
● Getting familiar with distributed data processing.
● Maximizing availability and reliability.
● Increasing data storage capacity.
● Leveraging data processing performance.
● Data locality is a silver bullet.
● Increasing cluster utilization.
● Taming giants by giving them a try.

12
Preparing the Linux Cluster -
VirtualBox

13
Preparing the Cluster - Hosting
● VirtualBox
– Memory Size, Disk Capacity and CPU cores.
– Network Interfaces.
● NAT, provides Internet.
● Host-Only, provides cluster communication.

14
Preparing the Cluster – Adding a
Host-Only Network

15
Preparing the Cluster – Adding a
NAT Interface

16
Preparing the Cluster – Adding a
Host-Only Interface

17
Preparing the Cluster – First Node
● Creating a Linux machine inside VirtualBox.
● Installing Linux. (I've used Ubuntu 12.04)
– Check Samba
– Check OpenSSH
● Give the first node all.
– Having an “install” folder on.
– Having primitives such as Java installed on.
● Shutting down the first node.

Preparing the Cluster – Cloning, The
18
Virtual Box Side
● Cloning the first node. (tutorial)

Preparing the Cluster – Cloning, the
19
Linux side
● Turning the new node on.
● Network configuration
– sudo nano /etc/hosts
– sudo nano /etc/hostname
– sudo nano /etc/network/interfaces
– sudo rm /etc/udev/rules.d/70-persistent-net.rules
● sudo reboot

20
Preparing the Cluster – No
Password Login
● Do this:
– ssh-keygen
– ssh-copy-id -i ~/.ssh/id_rsa.pub user@host
● Or this:
– ssh-keygen -t dsa -p '' -f ~/.ssh/id_dsa
– scp .ssh/id_rsa.pub user@host:~/master_key
– ssh user@host
– cat master_key >> ./ssh/authorized_keys

21
Preparing the Cluster – Distributed
Shell
● Do it like a Commander
– Installing DSH (Optional)

22
Preparing the Cluster – Enjoy it
● To scale your cluster just repeat the cloning
step.

23
Next?
● An introduction to distributed Log Management
and analytical search-engines.
– How Elasticsearch works?
– Workshop.
● An introduction to Apache Hadoop
– How Apache Hadoop works?
– Workshop.

What's hot

Install hadoop in a clusterXuhong Zhang

Containers > VMsDavid Timothy Strauss

Understanding blue store, Ceph's new storage backend - Tim Serong, SUSEOpenStack

Ceph-Mesos frameworkZhongyue Luo

LCA 2012: High Availability Sprinthastexo

Setting up repositories: Technical Requirements, Repository Software, Metad...Iryna Kuchma

深入了解Redisiammutex

Friends of Solr - Nutch & HDFSSaumitra Srivastav

Guavafbenault

Guava Overview Part 2 Bucharest JUG #2 Andrei Savu

Philipp Krenn "Elasticsearch (R)Evolution — You Know, for Search…"Fwdays

Get mysql clusterrunning-windowsJoeSg

Large Scale Crawling with Apache Nutch and Friendslucenerevolution

Caching. api. http 1.1Artjoker Digital

Container Security via Monitoring and Orchestration - Container Security SummitDavid Timothy Strauss

The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...Glenn K. Lockwood

Create a RESTful API with NodeJS, Express and MongoDBHengki Sihombing

Hidden gems in Apache Jackrabbit and BloomReach ForgeWoonsan Ko

TWJUG 2016 - Mogilefs, 簡約可靠的儲存方案Hua Chu

dba_lounge_Iasi: Everybody likes redisLiviu Costea

What's hot (20)

Install hadoop in a cluster

Containers > VMs

Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE

Ceph-Mesos framework

LCA 2012: High Availability Sprint

Setting up repositories: Technical Requirements, Repository Software, Metad...

深入了解Redis

Friends of Solr - Nutch & HDFS

Guava

Guava Overview Part 2 Bucharest JUG #2

Philipp Krenn "Elasticsearch (R)Evolution — You Know, for Search…"

Get mysql clusterrunning-windows

Large Scale Crawling with Apache Nutch and Friends

Caching. api. http 1.1

Container Security via Monitoring and Orchestration - Container Security Summit

The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...

Create a RESTful API with NodeJS, Express and MongoDB

Hidden gems in Apache Jackrabbit and BloomReach Forge

TWJUG 2016 - Mogilefs, 簡約可靠的儲存方案

dba_lounge_Iasi: Everybody likes redis

Viewers also liked

Dark dataAmir Sedighi

Big Data and Machine Learning Workshop - Day 7 @ UTACM Amir Sedighi

Big Data and Machine Learning Workshop - Day 5 @ UTACMAmir Sedighi

Case Studies on Big-Data Processing and Streaming - Iranian Java User GroupAmir Sedighi

An Introduction to Apache KafkaAmir Sedighi

آشنایی با داده‌های بزرگ و تکنیک‌های برنامه‌سازی برای پردازش داده‌های بزرگAmir Sedighi

Big Data Processing Utilizing Open-source Technologies - May 2015Amir Sedighi

Viewers also liked (7)

Dark data

Big Data and Machine Learning Workshop - Day 7 @ UTACM

Big Data and Machine Learning Workshop - Day 5 @ UTACM

Case Studies on Big-Data Processing and Streaming - Iranian Java User Group

An Introduction to Apache Kafka

آشنایی با داده‌های بزرگ و تکنیک‌های برنامه‌سازی برای پردازش داده‌های بزرگ

Big Data Processing Utilizing Open-source Technologies - May 2015

Similar to Distributed Data Processing Workshop - SBU

Node in Real Time - The BeginningAxilis

Joomla on Raspberry Pi using Nginx - Nederlandse Linux Gebruikers Group novem...Peter Martin

New Jersey Red Hat Users Group Presentation: Provisioning anywhereRodrique Heron

Polstra 44con2012Philip Polstra

Hacking and Forensics on the Go - 44CON 201244CON

Building SuperComputers @ HomeAbhishek Parolkar

Network Automation: Ansible 101APNIC

Lightweight Virtualization with Linux Containers and Docker | YaC 2013dotCloud

Lightweight Virtualization with Linux Containers and Docker I YaC 2013Docker, Inc.

Deploy Mediawiki Using FIWARE Lab FacilitiesFIWARE

Deploy MediaWiki usgin Fiware Lab FacilitiesJosé Ignacio Carretero Guarde

IPv6 training guide - Yuval ShaulIsraeli Internet Association technology committee

[OpenStack Day in Korea 2015] Track 1-6 - 갈라파고스의 이구아나, 인프라에 오픈소스를 올리다. 그래서 보이...OpenStack Korea Community

IPv6 at CSCSSwiss IPv6 Council

OpenStack Integration with OpenContrail and OpenDaylightSyed Moneeb

The Deck by Phil Polstra GrrCON2012Philip Polstra

Cobbler, Func and Puppet: Tools for Large Scale EnvironmentsMichael Zhang

Cobbler, Func and Puppet: Tools for Large Scale EnvironmentsViSenze - Artificial Intelligence for the Visual Web

Introduction to Stacki - World's fastest Linux server provisioning ToolSuresh Paulraj

NFD9 - Matt Peterson, Data Center OperationsCumulus Networks

Similar to Distributed Data Processing Workshop - SBU (20)

Node in Real Time - The Beginning

Joomla on Raspberry Pi using Nginx - Nederlandse Linux Gebruikers Group novem...

New Jersey Red Hat Users Group Presentation: Provisioning anywhere

Polstra 44con2012

Hacking and Forensics on the Go - 44CON 2012

Building SuperComputers @ Home

Network Automation: Ansible 101

Lightweight Virtualization with Linux Containers and Docker | YaC 2013

Lightweight Virtualization with Linux Containers and Docker I YaC 2013

Deploy Mediawiki Using FIWARE Lab Facilities

Deploy MediaWiki usgin Fiware Lab Facilities

IPv6 training guide - Yuval Shaul

[OpenStack Day in Korea 2015] Track 1-6 - 갈라파고스의 이구아나, 인프라에 오픈소스를 올리다. 그래서 보이...

IPv6 at CSCS

OpenStack Integration with OpenContrail and OpenDaylight

The Deck by Phil Polstra GrrCON2012

Cobbler, Func and Puppet: Tools for Large Scale Environments

Introduction to Stacki - World's fastest Linux server provisioning Tool

NFD9 - Matt Peterson, Data Center Operations

Recently uploaded

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

BigBuy dropshipping via API with DroFx.pptxolyaivanovalion

Introduction-to-Machine-Learning (1).pptxfirstjob4

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal

Capstone Project on IBM Data Analytics ProgramMoniSankarHazra

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls

Invezz.com - Grow your wealth with trading signalsInvezz1

Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

Zuja dropshipping via API with DroFx.pptxolyaivanovalion

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

VidaXL dropshipping via API with DroFx.pptxolyaivanovalion

CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion

FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823

Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat

Recently uploaded (20)

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130

BigBuy dropshipping via API with DroFx.pptx

Introduction-to-Machine-Learning (1).pptx

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...

Capstone Project on IBM Data Analytics Program

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night

Invezz.com - Grow your wealth with trading signals

Determinants of health, dimensions of health, positive health and spectrum of...

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

Zuja dropshipping via API with DroFx.pptx

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

VidaXL dropshipping via API with DroFx.pptx

CebaBaby dropshipping via API with DroFX.pptx

FESE Capital Markets Fact Sheet 2024 Q1.pdf

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...

Generative AI on Enterprise Cloud with NiFi and Milvus

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service

Distributed Data Processing Workshop - SBU

1. 1 کارگاه پردازش داده توزیع شده پردیس- شهیدبهشتی دانشکده علوم و مهندسی کامپیوتر درس: پایگاه داده توزیع شده استاد: دکتر هادی طباطبایی ارائه: ابوالفضل صدیقی آبان ۱۳۹۳

2. Distributed Data Processing School of Computer Science and Engineering A. Sedighi @amirsedighi Hexican.com sedighi@gmail.com

3. 3 Every Game needs it's Playing Yard

4. 4 Every Game needs it's Playing Yard

5. What can I do on a Single Machine? 5 ● MVC Programming ● Regular Biz Apps ● 100 GBs Data ● Web Surfing ● ...

6. 6 Linux Cluster

7. 7

8. 8

9. 9 Introduction This is a 4 sessions, hands-on, step-by-step tutorial on setting up, a Linux cluster on your machine (Notebook or PC), to try a few number of big-data processing frameworks and tools.

10. 10 What we are going to do? ● Your notebook, or a PC is just enough for starting. – Setting your Linux cluster up. ● Distributed Log Management and Realtime Search-Engines – What is Elasticsearch? – Elasticsearch on the cluster. – Monitoring and Usage. ● The most popular Distributed Data Processing Framework. – What is Apache Hadoop? – Apache Hadoop on the cluster. – Using Scenarios.

11. 11 What we would Learn? ● Leveraging our knowledge of Big-Data. ● Getting familiar with distributed data processing. ● Maximizing availability and reliability. ● Increasing data storage capacity. ● Leveraging data processing performance. ● Data locality is a silver bullet. ● Increasing cluster utilization. ● Taming giants by giving them a try.

12. 12 Preparing the Linux Cluster - VirtualBox

13. 13 Preparing the Cluster - Hosting ● VirtualBox – Memory Size, Disk Capacity and CPU cores. – Network Interfaces. ● NAT, provides Internet. ● Host-Only, provides cluster communication.

14. 14 Preparing the Cluster – Adding a Host-Only Network

15. 15 Preparing the Cluster – Adding a NAT Interface

16. 16 Preparing the Cluster – Adding a Host-Only Interface

17. 17 Preparing the Cluster – First Node ● Creating a Linux machine inside VirtualBox. ● Installing Linux. (I've used Ubuntu 12.04) – Check Samba – Check OpenSSH ● Give the first node all. – Having an “install” folder on. – Having primitives such as Java installed on. ● Shutting down the first node.

18. Preparing the Cluster – Cloning, The 18 Virtual Box Side ● Cloning the first node. (tutorial)

19. Preparing the Cluster – Cloning, the 19 Linux side ● Turning the new node on. ● Network configuration – sudo nano /etc/hosts – sudo nano /etc/hostname – sudo nano /etc/network/interfaces – sudo rm /etc/udev/rules.d/70-persistent-net.rules ● sudo reboot

20. 20 Preparing the Cluster – No Password Login ● Do this: – ssh-keygen – ssh-copy-id -i ~/.ssh/id_rsa.pub user@host ● Or this: – ssh-keygen -t dsa -p '' -f ~/.ssh/id_dsa – scp .ssh/id_rsa.pub user@host:~/master_key – ssh user@host – cat master_key >> ./ssh/authorized_keys

21. 21 Preparing the Cluster – Distributed Shell ● Do it like a Commander – Installing DSH (Optional)

22. 22 Preparing the Cluster – Enjoy it ● To scale your cluster just repeat the cloning step.

23. 23 Next? ● An introduction to distributed Log Management and analytical search-engines. – How Elasticsearch works? – Workshop. ● An introduction to Apache Hadoop – How Apache Hadoop works? – Workshop.

Distributed Data Processing Workshop - SBU

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Distributed Data Processing Workshop - SBU

Similar to Distributed Data Processing Workshop - SBU (20)

More from Amir Sedighi

More from Amir Sedighi (8)

Recently uploaded

Recently uploaded (20)

Distributed Data Processing Workshop - SBU