Slides from the January 2021 St. Louis Big Data IDEA meeting by Tim Bytnar regarding using Docker containers for a localized Hadoop development cluster.
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Localized Hadoop Development
1. Localized Hadoop Development
How to get up and running quickly by Tim Bytnar
This Photo by Unknown Author is licensed under CC BY-SA
2. Tim Bytnar
17 years in the industry
Data Engineering
Microsoft Development and Application Stack
Systems Automation
Datacenter Infrastructure
Network Engineering
Email: Tim.Bytnar@Daugherty.com
LinkedIn: https://www.linkedin.com/in/timbytnar/
I have not failed. I've just found 10,000 ways that won't work.
- Thomas A. Edison
3. What is the problem?
Hadoop development has a
steep requirement of having
access to an environment that
allows you to freely explore the
overwhelming ecosystem
4. Are there other options?
CLOUD PROVIDER “FREE” TIME BOOK LEARNING OR VIDEO
TRAINING
HOME LAB (IF YOU HAVE ONE OF
THESE LYING AROUND LIKE I DON’T)
7. What the environment
is for.
• Learning Hadoop!
• Developing…
• BASH Scripts
• Hive Automations
• Spark Processing
• Data Analysis (Tableau, PowerBI, Jupyter, etc…)
• Rapid Proof of Concept
• Will this dataset work in Hadoop?
• What advantages would Spark give me for this
workload?
11. Want any
help?
The repository is public and open for pull
requests or forks
Future Plans
• Keep it updated
• Add more modularity
• Add walkthroughs and challenges
• Improve Cross-platform Portability
• Baseline Performance Optimized Version
13. Tim Bytnar
Email: Tim.Bytnar@Daugherty.com
LinkedIn: https://www.linkedin.com/in/timbytnar/
> git clone https://github.com/tbytnar/docker-hive.git
Thank you to:
Ivan Ermilov and his team at Big Data Europe
http://github.com/big-data-europe/docker-hadoop
http://github.com/big-data-europe/docker-hive
Editor's Notes
Thank you for attending today and thank you for giving me your time.
Tonight, I’ll be talking a bit about training and developing in Hadoop and particularly the challenges of doing so.
First that awkward narcissistic slide where I tell you a little about myself.
Like many of you I grew up lovingly addicted to technology, especially computers. Seventeen years ago I finally turned that passion into a career and over that time I’ve gotten my hands into many different verticals.
Much of that time has been spent working with data either as a DBA or as an engineer. Paired with that has been a lot of time in the Microsoft stack either developing and supporting software applications or deploying and managing server infrastructure.
As most of my career has been spent in managed hosting, I’ve also had quite of bit of experience working with systems automation, monitoring, infrastructure design and implementation and a little dabbling in network engineering.
I’ve put my favorite quote there by Thomas Edison. [READ THE QUOTE] You’ll find out why I like that quote so much in a bit.
So, what IS the problem exactly?
Well, I should probably start with my story. I got interested in Big Data several years ago when the term became mainstream. I did my typical Google-fu to see what I could learn about the technology and maybe convince my managers to look at implementing it. No dice. It felt like the more I dug the more questions I had. Hadoop, HDFS, YARN, PIG, SQOOP, MapReduce, Spark, Hive, Solar, Lucene, Zookeeper, Oozie… and I’ve only scratched the surface of the entire ecosystem. By the time I got INTO big data and Hadoop, it was already overwhelming.
Alright fine, I’ll knuckle down and get a private environment setup for myself so I can start learning this behemoth. At the time, most of the guides I followed all directed me to the cloud providers…which I followed…and a several hundred dollar bill later after forgetting that I left a cluster online for a month put a big price tag on this lesson. And the effect of that? Well, I shied away, opting instead to try to learn Hadoop in other people’s environments… which of course took a lot more time.
So Hadoop has a steep learning requirement that is … having an environment to learn with in the first place.
“Well but Tim there must be other options out there.” you’re probably saying right now. “What about Cloudera’s Quickstart VM?” you’re asking. Well Cloudera has ended the Quickstart environment in favor of pushing their “free” trial of a hosted product. There are other options and some of them can be pretty effective.
Let’s touch back on the Cloud Hosted method. There is a vast number of guides that will take you step-by-step through spinning up a Hadoop cluster in each of the major Cloud Providers. I will warn you that a lot of those guides are outdated and will have you scratching your head with older or mismatched versions of components. Also set yourself a reminder. Shut that thing down when you’re done with it, your wallet will thank you later.
As for Book learning or Video Training, I’ve always envied people who were able to sit down and read a training manual cover to cover and absorb all of that knowledge. Myself? I learn better when I’m getting my hands dirty. Video training ala Pluralsight or Linda does a pretty good job, but usually only get you so far before sending you off on your own without a working environment to use.
And of course for those of you who are fortunate enough to have a full Cisco UCS chassis sitting in your basement just waiting for another workload to be thrown at it, more power to you folks. For the rest of us, if you have a spare PC lying around with a fair amount of memory (> 8GB), you can manage to cobble together a home lab and there are plenty of guides out there on how to do that.
So, what am I proposing?
Well, Docker to be quite honest. The portability, flexibility and scalability make this option REALLY attractive.
So attractive that I took a good college try at putting an environment together. Now… this is where I fall on my sword and recall that quote from Thomas Edison earlier. I… didn’t fail per-se… but I certainly found at LEAST 10,000 ways to build a Dockerized Hadoop environment incorrectly.
To that end, in my adventures in this space I’ve stumbled across several repositories that I’ve forked, enhanced and utilized to create my own environment.
What I’ve put together is a Docker-Compose file that make it quick and easy to build and provision a Hadoop cluster with Hive AND a multi-node Spark cluster, all of which is open source and ready to be further enhanced by anyone wanting to contribute.
My goal with this environment is to provide like-minded individuals a way to dip their toes into Hadoop at its core.
It’s barebones Hadoop, Hive and Spark. The idea is straight to the point, get data into the environment, add it to HDFS, create a Hive table for that data and get to work. If you choose to do so, you can leave it at that, or you can spin up the Spark cluster and really get your hands dirty with the data.
When you execute the docker-compose commands you see here, these are the containers that get provisioned. On the Hadoop side you have a namenode and a single datanode. You get a hive-server, a dedicated hive-metastore container and a postgres container that houses the hive metastore database. On the Spark side you get a Master and two Worker nodes. All of this can interconnect using Dockers bridge networking which also allows your workstation to connect to these components as if they were running on your machine.
Once you’ve mastered the basics here you can easily jump in and start adding more components like PIG or Impala or Ranger maybe.
We’ve covered why I built the environment but here’s a few reasons why I think it could be helpful for others and why I’m sharing it with you all today.
Obviously the most useful thing about this environment is enabling people to Learn Hadoop. And learn it without all the other distractions that enterprise deployments bring with them. I’m looking at you Cloudera.
Development can take place in this environment and I’m comfortable with saying it will get you at least 90% of the way there. You’ll want to spend that last 10% tweaking your code for performance reasons on whatever environment you’re working in.
And lastly maybe you’re assessing whether or not Hadoop is right for your team. With this environment you can rapidly stand up a proof of concept and decide whether Hadoop is right for your datasets or whether or not Spark would be advantageous to you.
The environment is not, let me repeat that, NOT for production purposes. It’s not optimized for performance at all and that’s on purpose. I think part of the fun of working at this capacity is troubleshooting all the hair-raising events that would come up in a production environment. So the installation is completely default. Throw your workload on it and tweak the performance to your liking.
I don’t know if I made this clear enough before but to reiterate, this environment is NOT for production. I’ve taken no security standards or best-practices in mind when building this. Again, that’s on purpose. If I were to secure everything the way it should be, no one would want to use it. That said, it’s the perfect environment for learning how to implement security policies, so feel free to go nuts. Worst case scenario you blow away your containers and spin up new ones ready to be broken again.
Getting started is as simple as cloning the GitHub repository and following the instructions posted in the README.
A few warnings or disclaimers. This hasn’t been thoroughly tested on all platforms, yes it’s Docker and as long as you’re running a recent version of that it SHOULD work fine, but I think we all know there’s a big difference between SHOULD work and WILL work.
Also, in the spirit of open source, I want to make it known that I will be actively maintaining this repository. So feel free to throw PRs my way or fork my work and enhance it for your own uses.
That brings me to the end of my presentation. Thank you all for sitting through my babbling, hopefully you found at least some of it useful.
Again, here is my contact information should you have ANY questions at all or what to help participate in the project.
A HUGE thank you to Ivan Ermilov and his team at Big Data Europe. Their work REALLY saved me on this, and I highly recommend you check out what they’ve done at their repositories.