SlideShare a Scribd company logo

Localized Hadoop Development

A
Adam Doyle

Slides from the January 2021 St. Louis Big Data IDEA meeting by Tim Bytnar regarding using Docker containers for a localized Hadoop development cluster.

1 of 13
Download to read offline
Localized Hadoop Development
How to get up and running quickly by Tim Bytnar
This Photo by Unknown Author is licensed under CC BY-SA
Tim Bytnar
17 years in the industry
Data Engineering
Microsoft Development and Application Stack
Systems Automation
Datacenter Infrastructure
Network Engineering
Email: Tim.Bytnar@Daugherty.com
LinkedIn: https://www.linkedin.com/in/timbytnar/
I have not failed. I've just found 10,000 ways that won't work.
- Thomas A. Edison
What is the problem?
Hadoop development has a
steep requirement of having
access to an environment that
allows you to freely explore the
overwhelming ecosystem
Are there other options?
CLOUD PROVIDER “FREE” TIME BOOK LEARNING OR VIDEO
TRAINING
HOME LAB (IF YOU HAVE ONE OF
THESE LYING AROUND LIKE I DON’T)
What do you
propose?
This Photo by Unknown Author is licensed under CC BY-SA
Dockerized Hadoop and Spark Environments

Recommended

Web 101 by Jennifer Lill
Web 101 by Jennifer LillWeb 101 by Jennifer Lill
Web 101 by Jennifer LillJennifer Lill
 
Big Event Looping Deck
Big Event Looping DeckBig Event Looping Deck
Big Event Looping DeckSteve Lange
 
Data Virtualization: revolutionizing database cloning
Data Virtualization: revolutionizing database cloningData Virtualization: revolutionizing database cloning
Data Virtualization: revolutionizing database cloningKyle Hailey
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesDatabricks
 
First Step for Big Data with Apache Hadoop
First Step for Big Data with Apache HadoopFirst Step for Big Data with Apache Hadoop
First Step for Big Data with Apache HadoopBorn2Learn Co., Ltd
 
Intro to Python for Data Science
Intro to Python for Data ScienceIntro to Python for Data Science
Intro to Python for Data ScienceTJ Stalcup
 
Hadoop Summit Amsterdam 2013 - Making Hadoop Ready for Prime Time - Syncsort ...
Hadoop Summit Amsterdam 2013 - Making Hadoop Ready for Prime Time - Syncsort ...Hadoop Summit Amsterdam 2013 - Making Hadoop Ready for Prime Time - Syncsort ...
Hadoop Summit Amsterdam 2013 - Making Hadoop Ready for Prime Time - Syncsort ...Steven Totman
 
BigData Meets the Federal Data Center
BigData Meets the Federal Data CenterBigData Meets the Federal Data Center
BigData Meets the Federal Data CenterAbe Usher
 

More Related Content

Similar to Localized Hadoop Development

Hybrid my sql_hadoop_datawarehouse
Hybrid my sql_hadoop_datawarehouseHybrid my sql_hadoop_datawarehouse
Hybrid my sql_hadoop_datawarehouseLaine Campbell
 
Design for X: Exploring Product Design with Apache Spark and GraphLab
Design for X: Exploring Product Design with Apache Spark and GraphLabDesign for X: Exploring Product Design with Apache Spark and GraphLab
Design for X: Exploring Product Design with Apache Spark and GraphLabAmanda Casari
 
Open source e_discovery
Open source e_discoveryOpen source e_discovery
Open source e_discoveryMark Kerzner
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache SparkBTI360
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talksyhadoop
 
Big data-denis-rothman
Big data-denis-rothmanBig data-denis-rothman
Big data-denis-rothmanDenis Rothman
 
Intro to Python for Data Science
Intro to Python for Data ScienceIntro to Python for Data Science
Intro to Python for Data ScienceTJ Stalcup
 
Web Performance & You
Web Performance & YouWeb Performance & You
Web Performance & YouDave Olsen
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
Open source secret_sauce_apache_con_2010
Open source secret_sauce_apache_con_2010Open source secret_sauce_apache_con_2010
Open source secret_sauce_apache_con_2010Ted Husted
 
Agile Data: revolutionizing data and database cloning
Agile Data: revolutionizing data and database cloningAgile Data: revolutionizing data and database cloning
Agile Data: revolutionizing data and database cloningKyle Hailey
 
What is the semantic web
What is the semantic webWhat is the semantic web
What is the semantic webDarren Meehan
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsTuri, Inc.
 
Cloud computing and Hadoop introduction
Cloud computing and Hadoop introductionCloud computing and Hadoop introduction
Cloud computing and Hadoop introductionchristian.perez
 
Business Intelligence for normal people
Business Intelligence for normal peopleBusiness Intelligence for normal people
Business Intelligence for normal peoplemark madsen
 
Murli Thirumale, CEO Ocarina Networks
Murli Thirumale, CEO Ocarina NetworksMurli Thirumale, CEO Ocarina Networks
Murli Thirumale, CEO Ocarina NetworksEntrepreneurTrek
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdbjixuan1989
 

Similar to Localized Hadoop Development (20)

Hybrid my sql_hadoop_datawarehouse
Hybrid my sql_hadoop_datawarehouseHybrid my sql_hadoop_datawarehouse
Hybrid my sql_hadoop_datawarehouse
 
Design for X: Exploring Product Design with Apache Spark and GraphLab
Design for X: Exploring Product Design with Apache Spark and GraphLabDesign for X: Exploring Product Design with Apache Spark and GraphLab
Design for X: Exploring Product Design with Apache Spark and GraphLab
 
Open source e_discovery
Open source e_discoveryOpen source e_discovery
Open source e_discovery
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
 
Big data-denis-rothman
Big data-denis-rothmanBig data-denis-rothman
Big data-denis-rothman
 
Intro to Python for Data Science
Intro to Python for Data ScienceIntro to Python for Data Science
Intro to Python for Data Science
 
Web Performance & You
Web Performance & YouWeb Performance & You
Web Performance & You
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Open source secret_sauce_apache_con_2010
Open source secret_sauce_apache_con_2010Open source secret_sauce_apache_con_2010
Open source secret_sauce_apache_con_2010
 
Agile Data: revolutionizing data and database cloning
Agile Data: revolutionizing data and database cloningAgile Data: revolutionizing data and database cloning
Agile Data: revolutionizing data and database cloning
 
Tech
TechTech
Tech
 
Big data hadoop
Big data hadoopBig data hadoop
Big data hadoop
 
What is the semantic web
What is the semantic webWhat is the semantic web
What is the semantic web
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
 
Cloud computing and Hadoop introduction
Cloud computing and Hadoop introductionCloud computing and Hadoop introduction
Cloud computing and Hadoop introduction
 
Business Intelligence for normal people
Business Intelligence for normal peopleBusiness Intelligence for normal people
Business Intelligence for normal people
 
Murli Thirumale, CEO Ocarina Networks
Murli Thirumale, CEO Ocarina NetworksMurli Thirumale, CEO Ocarina Networks
Murli Thirumale, CEO Ocarina Networks
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
 
AI from Space using Azure
AI from Space using AzureAI from Space using Azure
AI from Space using Azure
 

More from Adam Doyle

Data Engineering Roles
Data Engineering RolesData Engineering Roles
Data Engineering RolesAdam Doyle
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster ServicesAdam Doyle
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations PresentationAdam Doyle
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowAdam Doyle
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAdam Doyle
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 
The new big data
The new big dataThe new big data
The new big dataAdam Doyle
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020Adam Doyle
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleAdam Doyle
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAAdam Doyle
 
Retooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackRetooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackAdam Doyle
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020Adam Doyle
 
How stlrda does data
How stlrda does dataHow stlrda does data
How stlrda does dataAdam Doyle
 
Tailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsTailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsAdam Doyle
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingAdam Doyle
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019Adam Doyle
 
Data Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleData Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleAdam Doyle
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user groupAdam Doyle
 

More from Adam Doyle (20)

ML Ops.pptx
ML Ops.pptxML Ops.pptx
ML Ops.pptx
 
Data Engineering Roles
Data Engineering RolesData Engineering Roles
Data Engineering Roles
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFI
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
The new big data
The new big dataThe new big data
The new big data
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEA
 
Retooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackRetooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech Stack
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
 
How stlrda does data
How stlrda does dataHow stlrda does data
How stlrda does data
 
Tailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsTailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analytics
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-making
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019
 
Data Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleData Engineering and the Data Science Lifecycle
Data Engineering and the Data Science Lifecycle
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
 

Recently uploaded

What you need to know about Generative AI and Data Management?
What you need to know about Generative AI and Data Management?What you need to know about Generative AI and Data Management?
What you need to know about Generative AI and Data Management?Denodo
 
EXCEL-VLOOKUP-AND-HLOOKUP LECTURE NOTES ALL EXCEL VLOOKUP NOTES PDF
EXCEL-VLOOKUP-AND-HLOOKUP LECTURE NOTES ALL EXCEL VLOOKUP NOTES PDFEXCEL-VLOOKUP-AND-HLOOKUP LECTURE NOTES ALL EXCEL VLOOKUP NOTES PDF
EXCEL-VLOOKUP-AND-HLOOKUP LECTURE NOTES ALL EXCEL VLOOKUP NOTES PDFProject Cubicle
 
Choose your perfect jacket.pdf
Choose your perfect jacket.pdfChoose your perfect jacket.pdf
Choose your perfect jacket.pdfAlexia Trejo
 
Ratio analysis, Formulas, Advantage PPt.pptx
Ratio analysis, Formulas, Advantage PPt.pptxRatio analysis, Formulas, Advantage PPt.pptx
Ratio analysis, Formulas, Advantage PPt.pptxSugumarVenkai
 
fundamentals of digital imaging - POONAM.pptx
fundamentals of digital imaging - POONAM.pptxfundamentals of digital imaging - POONAM.pptx
fundamentals of digital imaging - POONAM.pptxPoonamRijal
 
Cousera Cap Course Datasets containing datasets from a Fictional Fitness Trac...
Cousera Cap Course Datasets containing datasets from a Fictional Fitness Trac...Cousera Cap Course Datasets containing datasets from a Fictional Fitness Trac...
Cousera Cap Course Datasets containing datasets from a Fictional Fitness Trac...Samuel Chukwuma
 
HayleyDerby_Market_Research_Spotify.docx
HayleyDerby_Market_Research_Spotify.docxHayleyDerby_Market_Research_Spotify.docx
HayleyDerby_Market_Research_Spotify.docxHayleyDerby
 
Customer Satisfaction Data - Multiple Linear Regression Model.pdf
Customer Satisfaction Data -  Multiple Linear Regression Model.pdfCustomer Satisfaction Data -  Multiple Linear Regression Model.pdf
Customer Satisfaction Data - Multiple Linear Regression Model.pdfruwanp2000
 
Unlocking New Insights Into the World of European Soccer Through the European...
Unlocking New Insights Into the World of European Soccer Through the European...Unlocking New Insights Into the World of European Soccer Through the European...
Unlocking New Insights Into the World of European Soccer Through the European...ThinkInnovation
 
Basics of Creating Graphs / Charts using Microsoft Excel
Basics of Creating Graphs / Charts using Microsoft ExcelBasics of Creating Graphs / Charts using Microsoft Excel
Basics of Creating Graphs / Charts using Microsoft ExcelTope Osanyintuyi
 
Artificial Intelligence for Vision: A walkthrough of recent breakthroughs
Artificial Intelligence for Vision:  A walkthrough of recent breakthroughsArtificial Intelligence for Vision:  A walkthrough of recent breakthroughs
Artificial Intelligence for Vision: A walkthrough of recent breakthroughsNikolas Markou
 
Operations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample ScreensOperations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample ScreensKondapi V Siva Rama Brahmam
 
introduction-to-crimean-congo-haemorrhagic-fever.pdf
introduction-to-crimean-congo-haemorrhagic-fever.pdfintroduction-to-crimean-congo-haemorrhagic-fever.pdf
introduction-to-crimean-congo-haemorrhagic-fever.pdfSalamaAdel
 
WOMEN IN TECH EVENT : Explore Salesforce Metadata.pptx
WOMEN IN TECH EVENT : Explore Salesforce Metadata.pptxWOMEN IN TECH EVENT : Explore Salesforce Metadata.pptx
WOMEN IN TECH EVENT : Explore Salesforce Metadata.pptxyosra Saidani
 
Introduction to data science.pdf-Definition,types and application of Data Sci...
Introduction to data science.pdf-Definition,types and application of Data Sci...Introduction to data science.pdf-Definition,types and application of Data Sci...
Introduction to data science.pdf-Definition,types and application of Data Sci...DrSumathyV
 
itc limited word file.pdf...............
itc limited word file.pdf...............itc limited word file.pdf...............
itc limited word file.pdf...............mahetamanav24
 
Tips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data GoalsTips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data GoalsDataArchiva
 

Recently uploaded (18)

What you need to know about Generative AI and Data Management?
What you need to know about Generative AI and Data Management?What you need to know about Generative AI and Data Management?
What you need to know about Generative AI and Data Management?
 
EXCEL-VLOOKUP-AND-HLOOKUP LECTURE NOTES ALL EXCEL VLOOKUP NOTES PDF
EXCEL-VLOOKUP-AND-HLOOKUP LECTURE NOTES ALL EXCEL VLOOKUP NOTES PDFEXCEL-VLOOKUP-AND-HLOOKUP LECTURE NOTES ALL EXCEL VLOOKUP NOTES PDF
EXCEL-VLOOKUP-AND-HLOOKUP LECTURE NOTES ALL EXCEL VLOOKUP NOTES PDF
 
Choose your perfect jacket.pdf
Choose your perfect jacket.pdfChoose your perfect jacket.pdf
Choose your perfect jacket.pdf
 
Ratio analysis, Formulas, Advantage PPt.pptx
Ratio analysis, Formulas, Advantage PPt.pptxRatio analysis, Formulas, Advantage PPt.pptx
Ratio analysis, Formulas, Advantage PPt.pptx
 
fundamentals of digital imaging - POONAM.pptx
fundamentals of digital imaging - POONAM.pptxfundamentals of digital imaging - POONAM.pptx
fundamentals of digital imaging - POONAM.pptx
 
Cousera Cap Course Datasets containing datasets from a Fictional Fitness Trac...
Cousera Cap Course Datasets containing datasets from a Fictional Fitness Trac...Cousera Cap Course Datasets containing datasets from a Fictional Fitness Trac...
Cousera Cap Course Datasets containing datasets from a Fictional Fitness Trac...
 
HayleyDerby_Market_Research_Spotify.docx
HayleyDerby_Market_Research_Spotify.docxHayleyDerby_Market_Research_Spotify.docx
HayleyDerby_Market_Research_Spotify.docx
 
Customer Satisfaction Data - Multiple Linear Regression Model.pdf
Customer Satisfaction Data -  Multiple Linear Regression Model.pdfCustomer Satisfaction Data -  Multiple Linear Regression Model.pdf
Customer Satisfaction Data - Multiple Linear Regression Model.pdf
 
Unlocking New Insights Into the World of European Soccer Through the European...
Unlocking New Insights Into the World of European Soccer Through the European...Unlocking New Insights Into the World of European Soccer Through the European...
Unlocking New Insights Into the World of European Soccer Through the European...
 
Basics of Creating Graphs / Charts using Microsoft Excel
Basics of Creating Graphs / Charts using Microsoft ExcelBasics of Creating Graphs / Charts using Microsoft Excel
Basics of Creating Graphs / Charts using Microsoft Excel
 
Artificial Intelligence for Vision: A walkthrough of recent breakthroughs
Artificial Intelligence for Vision:  A walkthrough of recent breakthroughsArtificial Intelligence for Vision:  A walkthrough of recent breakthroughs
Artificial Intelligence for Vision: A walkthrough of recent breakthroughs
 
Operations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample ScreensOperations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample Screens
 
introduction-to-crimean-congo-haemorrhagic-fever.pdf
introduction-to-crimean-congo-haemorrhagic-fever.pdfintroduction-to-crimean-congo-haemorrhagic-fever.pdf
introduction-to-crimean-congo-haemorrhagic-fever.pdf
 
WOMEN IN TECH EVENT : Explore Salesforce Metadata.pptx
WOMEN IN TECH EVENT : Explore Salesforce Metadata.pptxWOMEN IN TECH EVENT : Explore Salesforce Metadata.pptx
WOMEN IN TECH EVENT : Explore Salesforce Metadata.pptx
 
Electricity Year 2023_updated_22022024.pptx
Electricity Year 2023_updated_22022024.pptxElectricity Year 2023_updated_22022024.pptx
Electricity Year 2023_updated_22022024.pptx
 
Introduction to data science.pdf-Definition,types and application of Data Sci...
Introduction to data science.pdf-Definition,types and application of Data Sci...Introduction to data science.pdf-Definition,types and application of Data Sci...
Introduction to data science.pdf-Definition,types and application of Data Sci...
 
itc limited word file.pdf...............
itc limited word file.pdf...............itc limited word file.pdf...............
itc limited word file.pdf...............
 
Tips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data GoalsTips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data Goals
 

Localized Hadoop Development

  • 1. Localized Hadoop Development How to get up and running quickly by Tim Bytnar This Photo by Unknown Author is licensed under CC BY-SA
  • 2. Tim Bytnar 17 years in the industry Data Engineering Microsoft Development and Application Stack Systems Automation Datacenter Infrastructure Network Engineering Email: Tim.Bytnar@Daugherty.com LinkedIn: https://www.linkedin.com/in/timbytnar/ I have not failed. I've just found 10,000 ways that won't work. - Thomas A. Edison
  • 3. What is the problem? Hadoop development has a steep requirement of having access to an environment that allows you to freely explore the overwhelming ecosystem
  • 4. Are there other options? CLOUD PROVIDER “FREE” TIME BOOK LEARNING OR VIDEO TRAINING HOME LAB (IF YOU HAVE ONE OF THESE LYING AROUND LIKE I DON’T)
  • 5. What do you propose? This Photo by Unknown Author is licensed under CC BY-SA
  • 6. Dockerized Hadoop and Spark Environments
  • 7. What the environment is for. • Learning Hadoop! • Developing… • BASH Scripts • Hive Automations • Spark Processing • Data Analysis (Tableau, PowerBI, Jupyter, etc…) • Rapid Proof of Concept • Will this dataset work in Hadoop? • What advantages would Spark give me for this workload?
  • 10. How to get started? > git clone https://github.com/tbytnar/docker-hive.git
  • 11. Want any help? The repository is public and open for pull requests or forks Future Plans • Keep it updated • Add more modularity • Add walkthroughs and challenges • Improve Cross-platform Portability • Baseline Performance Optimized Version
  • 13. Tim Bytnar Email: Tim.Bytnar@Daugherty.com LinkedIn: https://www.linkedin.com/in/timbytnar/ > git clone https://github.com/tbytnar/docker-hive.git Thank you to: Ivan Ermilov and his team at Big Data Europe http://github.com/big-data-europe/docker-hadoop http://github.com/big-data-europe/docker-hive

Editor's Notes

  1. Thank you for attending today and thank you for giving me your time. Tonight, I’ll be talking a bit about training and developing in Hadoop and particularly the challenges of doing so.
  2. First that awkward narcissistic slide where I tell you a little about myself. Like many of you I grew up lovingly addicted to technology, especially computers. Seventeen years ago I finally turned that passion into a career and over that time I’ve gotten my hands into many different verticals. Much of that time has been spent working with data either as a DBA or as an engineer. Paired with that has been a lot of time in the Microsoft stack either developing and supporting software applications or deploying and managing server infrastructure. As most of my career has been spent in managed hosting, I’ve also had quite of bit of experience working with systems automation, monitoring, infrastructure design and implementation and a little dabbling in network engineering. I’ve put my favorite quote there by Thomas Edison. [READ THE QUOTE] You’ll find out why I like that quote so much in a bit.
  3. So, what IS the problem exactly? Well, I should probably start with my story. I got interested in Big Data several years ago when the term became mainstream. I did my typical Google-fu to see what I could learn about the technology and maybe convince my managers to look at implementing it. No dice. It felt like the more I dug the more questions I had. Hadoop, HDFS, YARN, PIG, SQOOP, MapReduce, Spark, Hive, Solar, Lucene, Zookeeper, Oozie… and I’ve only scratched the surface of the entire ecosystem. By the time I got INTO big data and Hadoop, it was already overwhelming. Alright fine, I’ll knuckle down and get a private environment setup for myself so I can start learning this behemoth. At the time, most of the guides I followed all directed me to the cloud providers…which I followed…and a several hundred dollar bill later after forgetting that I left a cluster online for a month put a big price tag on this lesson. And the effect of that? Well, I shied away, opting instead to try to learn Hadoop in other people’s environments… which of course took a lot more time. So Hadoop has a steep learning requirement that is … having an environment to learn with in the first place.
  4. “Well but Tim there must be other options out there.” you’re probably saying right now. “What about Cloudera’s Quickstart VM?” you’re asking. Well Cloudera has ended the Quickstart environment in favor of pushing their “free” trial of a hosted product. There are other options and some of them can be pretty effective. Let’s touch back on the Cloud Hosted method. There is a vast number of guides that will take you step-by-step through spinning up a Hadoop cluster in each of the major Cloud Providers. I will warn you that a lot of those guides are outdated and will have you scratching your head with older or mismatched versions of components. Also set yourself a reminder. Shut that thing down when you’re done with it, your wallet will thank you later. As for Book learning or Video Training, I’ve always envied people who were able to sit down and read a training manual cover to cover and absorb all of that knowledge. Myself? I learn better when I’m getting my hands dirty. Video training ala Pluralsight or Linda does a pretty good job, but usually only get you so far before sending you off on your own without a working environment to use. And of course for those of you who are fortunate enough to have a full Cisco UCS chassis sitting in your basement just waiting for another workload to be thrown at it, more power to you folks. For the rest of us, if you have a spare PC lying around with a fair amount of memory (> 8GB), you can manage to cobble together a home lab and there are plenty of guides out there on how to do that.
  5. So, what am I proposing? Well, Docker to be quite honest. The portability, flexibility and scalability make this option REALLY attractive. So attractive that I took a good college try at putting an environment together. Now… this is where I fall on my sword and recall that quote from Thomas Edison earlier. I… didn’t fail per-se… but I certainly found at LEAST 10,000 ways to build a Dockerized Hadoop environment incorrectly. To that end, in my adventures in this space I’ve stumbled across several repositories that I’ve forked, enhanced and utilized to create my own environment. What I’ve put together is a Docker-Compose file that make it quick and easy to build and provision a Hadoop cluster with Hive AND a multi-node Spark cluster, all of which is open source and ready to be further enhanced by anyone wanting to contribute.
  6. My goal with this environment is to provide like-minded individuals a way to dip their toes into Hadoop at its core. It’s barebones Hadoop, Hive and Spark. The idea is straight to the point, get data into the environment, add it to HDFS, create a Hive table for that data and get to work. If you choose to do so, you can leave it at that, or you can spin up the Spark cluster and really get your hands dirty with the data. When you execute the docker-compose commands you see here, these are the containers that get provisioned. On the Hadoop side you have a namenode and a single datanode. You get a hive-server, a dedicated hive-metastore container and a postgres container that houses the hive metastore database. On the Spark side you get a Master and two Worker nodes. All of this can interconnect using Dockers bridge networking which also allows your workstation to connect to these components as if they were running on your machine. Once you’ve mastered the basics here you can easily jump in and start adding more components like PIG or Impala or Ranger maybe.
  7. We’ve covered why I built the environment but here’s a few reasons why I think it could be helpful for others and why I’m sharing it with you all today. Obviously the most useful thing about this environment is enabling people to Learn Hadoop. And learn it without all the other distractions that enterprise deployments bring with them. I’m looking at you Cloudera. Development can take place in this environment and I’m comfortable with saying it will get you at least 90% of the way there. You’ll want to spend that last 10% tweaking your code for performance reasons on whatever environment you’re working in. And lastly maybe you’re assessing whether or not Hadoop is right for your team. With this environment you can rapidly stand up a proof of concept and decide whether Hadoop is right for your datasets or whether or not Spark would be advantageous to you.
  8. The environment is not, let me repeat that, NOT for production purposes. It’s not optimized for performance at all and that’s on purpose. I think part of the fun of working at this capacity is troubleshooting all the hair-raising events that would come up in a production environment. So the installation is completely default. Throw your workload on it and tweak the performance to your liking. I don’t know if I made this clear enough before but to reiterate, this environment is NOT for production. I’ve taken no security standards or best-practices in mind when building this. Again, that’s on purpose. If I were to secure everything the way it should be, no one would want to use it. That said, it’s the perfect environment for learning how to implement security policies, so feel free to go nuts. Worst case scenario you blow away your containers and spin up new ones ready to be broken again.
  9. Getting started is as simple as cloning the GitHub repository and following the instructions posted in the README. A few warnings or disclaimers. This hasn’t been thoroughly tested on all platforms, yes it’s Docker and as long as you’re running a recent version of that it SHOULD work fine, but I think we all know there’s a big difference between SHOULD work and WILL work. Also, in the spirit of open source, I want to make it known that I will be actively maintaining this repository. So feel free to throw PRs my way or fork my work and enhance it for your own uses.
  10. That brings me to the end of my presentation. Thank you all for sitting through my babbling, hopefully you found at least some of it useful. Again, here is my contact information should you have ANY questions at all or what to help participate in the project. A HUGE thank you to Ivan Ermilov and his team at Big Data Europe. Their work REALLY saved me on this, and I highly recommend you check out what they’ve done at their repositories.