Localized Hadoop Development

•Download as PPTX, PDF•

0 likes•201 views

Slides from the January 2021 St. Louis Big Data IDEA meeting by Tim Bytnar regarding using Docker containers for a localized Hadoop development cluster.

Data & Analytics

Localized Hadoop Development
How to get up and running quickly by Tim Bytnar
This Photo by Unknown Author is licensed under CC BY-SA

Tim Bytnar
17 years in the industry
Data Engineering
Microsoft Development and Application Stack
Systems Automation
Datacenter Infrastructure
Network Engineering
Email: Tim.Bytnar@Daugherty.com
LinkedIn: https://www.linkedin.com/in/timbytnar/
I have not failed. I've just found 10,000 ways that won't work.
- Thomas A. Edison

What is the problem?
Hadoop development has a
steep requirement of having
access to an environment that
allows you to freely explore the
overwhelming ecosystem

Are there other options?
CLOUD PROVIDER “FREE” TIME BOOK LEARNING OR VIDEO
TRAINING
HOME LAB (IF YOU HAVE ONE OF
THESE LYING AROUND LIKE I DON’T)

What do you
propose?
This Photo by Unknown Author is licensed under CC BY-SA

Dockerized Hadoop and Spark Environments

What the environment
is for.
• Learning Hadoop!
• Developing…
• BASH Scripts
• Hive Automations
• Spark Processing
• Data Analysis (Tableau, PowerBI, Jupyter, etc…)
• Rapid Proof of Concept
• Will this dataset work in Hadoop?
• What advantages would Spark give me for this
workload?

How to get
started?
> git clone https://github.com/tbytnar/docker-hive.git

Want any
help?
The repository is public and open for pull
requests or forks
Future Plans
• Keep it updated
• Add more modularity
• Add walkthroughs and challenges
• Improve Cross-platform Portability
• Baseline Performance Optimized Version

Tim Bytnar
Email: Tim.Bytnar@Daugherty.com
LinkedIn: https://www.linkedin.com/in/timbytnar/
> git clone https://github.com/tbytnar/docker-hive.git
Thank you to:
Ivan Ermilov and his team at Big Data Europe
http://github.com/big-data-europe/docker-hadoop
http://github.com/big-data-europe/docker-hive

Similar to Localized Hadoop Development

Hybrid my sql_hadoop_datawarehouseLaine Campbell

Design for X: Exploring Product Design with Apache Spark and GraphLabAmanda Casari

Open source e_discoveryMark Kerzner

Intro to Apache SparkBTI360

Hadoop at Yahoo! -- University Talksyhadoop

Big data-denis-rothmanDenis Rothman

Intro to Python for Data ScienceTJ Stalcup

Web Performance & YouDave Olsen

Introduction to Big Data & HadoopEdureka!

Open source secret_sauce_apache_con_2010Ted Husted

Agile Data: revolutionizing data and database cloningKyle Hailey

TechManabuYoneyama

Big data hadoopAgnieszka Zdebiak

What is the semantic webDarren Meehan

Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsTuri, Inc.

Cloud computing and Hadoop introductionchristian.perez

Business Intelligence for normal peoplemark madsen

Murli Thirumale, CEO Ocarina NetworksEntrepreneurTrek

From a student to an apache committer practice of apache io tdbjixuan1989

AI from Space using AzureChristos Charmatzis

Similar to Localized Hadoop Development (20)

Hybrid my sql_hadoop_datawarehouse

Design for X: Exploring Product Design with Apache Spark and GraphLab

Open source e_discovery

Intro to Apache Spark

Hadoop at Yahoo! -- University Talks

Big data-denis-rothman

Intro to Python for Data Science

Web Performance & You

Introduction to Big Data & Hadoop

Open source secret_sauce_apache_con_2010

Agile Data: revolutionizing data and database cloning

Tech

Big data hadoop

What is the semantic web

Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets

Cloud computing and Hadoop introduction

Business Intelligence for normal people

Murli Thirumale, CEO Ocarina Networks

From a student to an apache committer practice of apache io tdb

AI from Space using Azure

Recently uploaded

Industrialised data - the key to AI success.pdfLars Albertsson

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863

100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo

Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha

Data Warehouse , Data Cube Computationsit20ad004

PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Universitat Politècnica de Catalunya

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor

Data Science Jobs and Salaries Analysis.pptxFurkanTasci3

Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083

Recently uploaded (20)

Industrialised data - the key to AI success.pdf

20240419 - Measurecamp Amsterdam - SAM.pdf

Dubai Call Girls Wifey O52&786472 Call Girls Dubai

100-Concepts-of-AI by Anupama Kate .pptx

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改

Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps

From idea to production in a day – Leveraging Azure ML and Streamlit to build...

VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...

Data Warehouse , Data Cube Computation

PKS-TGC-1084-630 - Stage 1 Proposal.pptx

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai

Data Science Jobs and Salaries Analysis.pptx

Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...

Localized Hadoop Development

1. Localized Hadoop Development How to get up and running quickly by Tim Bytnar This Photo by Unknown Author is licensed under CC BY-SA

2. Tim Bytnar 17 years in the industry Data Engineering Microsoft Development and Application Stack Systems Automation Datacenter Infrastructure Network Engineering Email: Tim.Bytnar@Daugherty.com LinkedIn: https://www.linkedin.com/in/timbytnar/ I have not failed. I've just found 10,000 ways that won't work. - Thomas A. Edison

3. What is the problem? Hadoop development has a steep requirement of having access to an environment that allows you to freely explore the overwhelming ecosystem

4. Are there other options? CLOUD PROVIDER “FREE” TIME BOOK LEARNING OR VIDEO TRAINING HOME LAB (IF YOU HAVE ONE OF THESE LYING AROUND LIKE I DON’T)

5. What do you propose? This Photo by Unknown Author is licensed under CC BY-SA

6. Dockerized Hadoop and Spark Environments

7. What the environment is for. • Learning Hadoop! • Developing… • BASH Scripts • Hive Automations • Spark Processing • Data Analysis (Tableau, PowerBI, Jupyter, etc…) • Rapid Proof of Concept • Will this dataset work in Hadoop? • What advantages would Spark give me for this workload?

8. What the environment is NOT for.

9. Demo Time

10. How to get started? > git clone https://github.com/tbytnar/docker-hive.git

11. Want any help? The repository is public and open for pull requests or forks Future Plans • Keep it updated • Add more modularity • Add walkthroughs and challenges • Improve Cross-platform Portability • Baseline Performance Optimized Version

12. Questions and Answers

13. Tim Bytnar Email: Tim.Bytnar@Daugherty.com LinkedIn: https://www.linkedin.com/in/timbytnar/ > git clone https://github.com/tbytnar/docker-hive.git Thank you to: Ivan Ermilov and his team at Big Data Europe http://github.com/big-data-europe/docker-hadoop http://github.com/big-data-europe/docker-hive

Editor's Notes

Thank you for attending today and thank you for giving me your time. Tonight, I’ll be talking a bit about training and developing in Hadoop and particularly the challenges of doing so.
First that awkward narcissistic slide where I tell you a little about myself. Like many of you I grew up lovingly addicted to technology, especially computers. Seventeen years ago I finally turned that passion into a career and over that time I’ve gotten my hands into many different verticals. Much of that time has been spent working with data either as a DBA or as an engineer. Paired with that has been a lot of time in the Microsoft stack either developing and supporting software applications or deploying and managing server infrastructure. As most of my career has been spent in managed hosting, I’ve also had quite of bit of experience working with systems automation, monitoring, infrastructure design and implementation and a little dabbling in network engineering. I’ve put my favorite quote there by Thomas Edison. [READ THE QUOTE] You’ll find out why I like that quote so much in a bit.
So, what IS the problem exactly? Well, I should probably start with my story. I got interested in Big Data several years ago when the term became mainstream. I did my typical Google-fu to see what I could learn about the technology and maybe convince my managers to look at implementing it. No dice. It felt like the more I dug the more questions I had. Hadoop, HDFS, YARN, PIG, SQOOP, MapReduce, Spark, Hive, Solar, Lucene, Zookeeper, Oozie… and I’ve only scratched the surface of the entire ecosystem. By the time I got INTO big data and Hadoop, it was already overwhelming. Alright fine, I’ll knuckle down and get a private environment setup for myself so I can start learning this behemoth. At the time, most of the guides I followed all directed me to the cloud providers…which I followed…and a several hundred dollar bill later after forgetting that I left a cluster online for a month put a big price tag on this lesson. And the effect of that? Well, I shied away, opting instead to try to learn Hadoop in other people’s environments… which of course took a lot more time. So Hadoop has a steep learning requirement that is … having an environment to learn with in the first place.
“Well but Tim there must be other options out there.” you’re probably saying right now. “What about Cloudera’s Quickstart VM?” you’re asking. Well Cloudera has ended the Quickstart environment in favor of pushing their “free” trial of a hosted product. There are other options and some of them can be pretty effective. Let’s touch back on the Cloud Hosted method. There is a vast number of guides that will take you step-by-step through spinning up a Hadoop cluster in each of the major Cloud Providers. I will warn you that a lot of those guides are outdated and will have you scratching your head with older or mismatched versions of components. Also set yourself a reminder. Shut that thing down when you’re done with it, your wallet will thank you later. As for Book learning or Video Training, I’ve always envied people who were able to sit down and read a training manual cover to cover and absorb all of that knowledge. Myself? I learn better when I’m getting my hands dirty. Video training ala Pluralsight or Linda does a pretty good job, but usually only get you so far before sending you off on your own without a working environment to use. And of course for those of you who are fortunate enough to have a full Cisco UCS chassis sitting in your basement just waiting for another workload to be thrown at it, more power to you folks. For the rest of us, if you have a spare PC lying around with a fair amount of memory (> 8GB), you can manage to cobble together a home lab and there are plenty of guides out there on how to do that.
So, what am I proposing? Well, Docker to be quite honest. The portability, flexibility and scalability make this option REALLY attractive. So attractive that I took a good college try at putting an environment together. Now… this is where I fall on my sword and recall that quote from Thomas Edison earlier. I… didn’t fail per-se… but I certainly found at LEAST 10,000 ways to build a Dockerized Hadoop environment incorrectly. To that end, in my adventures in this space I’ve stumbled across several repositories that I’ve forked, enhanced and utilized to create my own environment. What I’ve put together is a Docker-Compose file that make it quick and easy to build and provision a Hadoop cluster with Hive AND a multi-node Spark cluster, all of which is open source and ready to be further enhanced by anyone wanting to contribute.
My goal with this environment is to provide like-minded individuals a way to dip their toes into Hadoop at its core. It’s barebones Hadoop, Hive and Spark. The idea is straight to the point, get data into the environment, add it to HDFS, create a Hive table for that data and get to work. If you choose to do so, you can leave it at that, or you can spin up the Spark cluster and really get your hands dirty with the data. When you execute the docker-compose commands you see here, these are the containers that get provisioned. On the Hadoop side you have a namenode and a single datanode. You get a hive-server, a dedicated hive-metastore container and a postgres container that houses the hive metastore database. On the Spark side you get a Master and two Worker nodes. All of this can interconnect using Dockers bridge networking which also allows your workstation to connect to these components as if they were running on your machine. Once you’ve mastered the basics here you can easily jump in and start adding more components like PIG or Impala or Ranger maybe.
We’ve covered why I built the environment but here’s a few reasons why I think it could be helpful for others and why I’m sharing it with you all today. Obviously the most useful thing about this environment is enabling people to Learn Hadoop. And learn it without all the other distractions that enterprise deployments bring with them. I’m looking at you Cloudera. Development can take place in this environment and I’m comfortable with saying it will get you at least 90% of the way there. You’ll want to spend that last 10% tweaking your code for performance reasons on whatever environment you’re working in. And lastly maybe you’re assessing whether or not Hadoop is right for your team. With this environment you can rapidly stand up a proof of concept and decide whether Hadoop is right for your datasets or whether or not Spark would be advantageous to you.
The environment is not, let me repeat that, NOT for production purposes. It’s not optimized for performance at all and that’s on purpose. I think part of the fun of working at this capacity is troubleshooting all the hair-raising events that would come up in a production environment. So the installation is completely default. Throw your workload on it and tweak the performance to your liking. I don’t know if I made this clear enough before but to reiterate, this environment is NOT for production. I’ve taken no security standards or best-practices in mind when building this. Again, that’s on purpose. If I were to secure everything the way it should be, no one would want to use it. That said, it’s the perfect environment for learning how to implement security policies, so feel free to go nuts. Worst case scenario you blow away your containers and spin up new ones ready to be broken again.
Getting started is as simple as cloning the GitHub repository and following the instructions posted in the README. A few warnings or disclaimers. This hasn’t been thoroughly tested on all platforms, yes it’s Docker and as long as you’re running a recent version of that it SHOULD work fine, but I think we all know there’s a big difference between SHOULD work and WILL work. Also, in the spirit of open source, I want to make it known that I will be actively maintaining this repository. So feel free to throw PRs my way or fork my work and enhance it for your own uses.
That brings me to the end of my presentation. Thank you all for sitting through my babbling, hopefully you found at least some of it useful. Again, here is my contact information should you have ANY questions at all or what to help participate in the project. A HUGE thank you to Ivan Ermilov and his team at Big Data Europe. Their work REALLY saved me on this, and I highly recommend you check out what they’ve done at their repositories.

Localized Hadoop Development

Recommended

Recommended

More Related Content

Similar to Localized Hadoop Development

Similar to Localized Hadoop Development (20)

More from Adam Doyle

More from Adam Doyle (20)

Recently uploaded

Recently uploaded (20)

Localized Hadoop Development

Editor's Notes