SlideShare a Scribd company logo
1 of 30
The Leader in Big Data Consulting
www.mammothdata.com | @mammothdataco
Everything You Need To Know About Hadoop Now
{Percona University | Raleigh}
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
PaaKow Acquah, Lead Consultant
Joined OSI January 2008
Currently leading a Data Lake design and implementation for a large Californian utility company
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Open Software Integrators
Open Software Integrators is a Big Data consulting and services company specializing in Hadoop,
Cassandra, MongoDB and other NoSQL technologies. OSI focuses on executive strategy, initial
install, design and implementation.
Founded January 2008 by Andrew C. Oliver
Based in downtown Durham, NC
Partnered with Hortonworks, MongoDB, DataStax, Cloudera, Couchbase, Cloudbees & Neo Technology
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Overview
What is Hadoop anyhow?
What is Hadoop Good For?
What isn’t it good for?
How do you get data into Hadoop?
How do you get data out of Hadoop?
How do you process data in Hadoop?
How do you analyze data in Hadoop?
How do you secure Hadoop?
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
But first...
This is an overview talk intended as a roadmap to point you at the most important bits to learn on the
way…
It is not comprehensive training…
It is not an in-depth look at any part of Hadoop
It is a rather high level selective overview of the Hadoop ecosystem
www.mammothdata.com | @mammothdataco
What Is Hadoop Anyhow?
{Percona University | Raleigh}
www.mammothdata.com | @mammothdataco
A platform for distributed computing
2011
HDFS
Hive
2012
HDFS
YARN
Hive
HBase
2014
HDFS
Hive
Yarn
HBase
Spark
Storm
Kafka
Mahout
Squoop
Oozie
...
Hadoop is...
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Hadoop is...
HDFS
Distributed Filesystem similar to Gluster, Ceph, etc.
You can use other distributed filesystems in place of HDFS
Blocks are distributed, and by default duplicated on at least 1 other node
128m default block size
Restful API, CLI tools, third-party tools to “mount” HDFS on Linux (stable), Windows (ymmv),
Mac (?)
DO NOT PUT YOUR DATA NODES ON A SAN! IT IS WRONG! DO NOT DO IT! EVEN ON THURSDAY!
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Hadoop is...
YARN
Yet another resource negotiator
schedules “work” among nodes, distributes the “processing”
Map Reduce is
an API
an algorithm, data is mapped to nodes, the answers are “reduced” to a single answer
Hive is
HDFS/Hadoop based data warehousing
SQL, JDBC, ODBC
Tables map to files on HDFS
No updates, deletes, transactions (but coming in “Stinger.next”)
www.mammothdata.com | @mammothdataco
HBase
a column family database
ACID
relatively low-latency
And a whole lot more
Hadoop is...
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Hadoop is...
An ecosystem of tools for distributed processing and storage of data.
www.mammothdata.com | @mammothdataco
What Is Hadoop Good For?
{Percona University | Raleigh}
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
What Is Hadoop Good For?
Working with large amounts of data in batch
ETL processing / Data Transformation
Analytics / BI
Integration (Data Lake, Enterprise Data Hub)
Working with streams of data
Events
Log data
Time series or similar data (HBase)
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
What Is Hadoop Bad At?
What is Hadoop bad at?
Quick jobs - i.e. Hive/Map Reduce setup time is measured in seconds to minutes.
Lots of small files (128MB block size = 0 byte files are 128m files)
General DBMS stuff - HBase is a much more “specific” database than MySQL/etc.
High Availability
WHA???
Knox, Oozie, etc all have shaky support if any for HA Namenodes.
www.mammothdata.com | @mammothdataco
How Do You Get Data Into/Out Of Hadoop?
{Percona University | Raleigh}
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
How Do You Get Data Into Hadoop?
How do you get data into Hadoop?
Sqoop it from an RDBMS
Use JDBC or ODBC and push into Hive from an external DB
Push data into Hive with the restful API
Put an extract file onto HDFS with the REST API
process it into Hive directly with a LOAD DATA statement
transform/process it into Hive using PIG
use Java
Message it in there with Kafka, RabbitMQ or similar MQ and custom “spout” for Storm
Use any multitude of APIs that write data into HDFS, HBase, Hive, etc.
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
How Do You Get Data Out Of Hadoop?
How do you get data out of Hadoop?
Should you be getting it out or should you process it there?
JDBC/ODBC to Hive
HBase can be mounted into Hive
REST APIs for Hive/HDFS
APIs for Kafka, Spark, Storm, etc (subscribe)
HDCP to another HDFS
Mount it with FUSE and use your favorite Linux tool
hadoop fs -cat /path/to/file/on/hdfs |grep stuff > mynewlocalfile
www.mammothdata.com | @mammothdataco
How Do You Process Data In Hadoop?
{Percona University | Raleigh}
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
How Do You Process Data In Hadoop?
Map-reduce Java API
Hive supports SQL (soon to be not a subset)
PIG can munge files on HDFS and can work with Hive
Storm and Spark have their own APIs for dealing with events or so-called micro-batches of data
There are numerous toolkits
Mahout - common machine learning algorithms (many not very parallelizable/etc)
MLib - Machine learning built on Spark
GraphX
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
How Do You Analyze Data In Hadoop?
Most major BI tools now support Hadoop
Tableau
Pentaho
Datameer
Your favorite probably here
All that stuff is for l4m3rs, use the command line interface :-)
hive -e ‘select * from sometable’
pig hdfs://some/dir/myscript.pig
Use RStudio and write some R to predict what sales will be next month (you will be sort of wrong
probably)
Use your favorite SQL tool that supports JDBC/ODBC
Use Hue
www.mammothdata.com | @mammothdataco
How Do You Secure Hadoop?
{Percona University | Raleigh}
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
How Do You Secure Hadoop?
HDFS supports POSIX (that means Linux-style) filesystem security
The most complete security authentication throughout Hadoop is based on Kerberos (yeah I know).
You can do it with just straight LDAP too, but it isn’t integrated.
Knox supplies “perimeter-based security” for (only):
Hive
HDFS
Ooozie
HBase
HCatalog
Supposedly Argus will save us from all of this!
www.mammothdata.com | @mammothdataco
Other Considerations
{Percona University | Raleigh}
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Cacophony
Disaster Recovery
Falcon (alpha quality)
Workflow
Flume
Schedule/trigger/orchestrate those ETL jobs
Oozie
Install, configure, monitor Hadoop
Ambari
Use tables in both Pig and Hive
HCatalog
www.mammothdata.com | @mammothdataco
Ambari
www.mammothdata.com | @mammothdataco
Hue
www.mammothdata.com | @mammothdataco
Hue Editing Oozie
www.mammothdata.com | @mammothdataco
REGISTER
file:///usr/lib/pig/piggybank.jar;
define SUBSTRING
org.apache.pig.piggybank.evaluation
.string.SUBSTRING();
rows = load '$FILEPATH' using
org.apache.pig.piggybank.storage.CS
VExcelStorage('u001a') as (
a0:chararray,
a1:chararray,
a2:chararray,
a3:chararray,
a4:chararray,
a5:chararray,
a6:chararray,
a7:chararray,
a8:chararray,
row = foreach rows GENERATE
REPLACE((TRIM($0)),'NULL','') as
orderid,
REPLACE((TRIM($1)),'NULL','') as
customerid,
REPLACE((TRIM($2)),'NULL','') as
customername,
REPLACE((TRIM($3)),'NULL','') as
address,
REPLACE((TRIM($4)),'NULL','') as
city,
REPLACE((TRIM($5)),'NULL','') as
state,
REPLACE((TRIM($6)),'NULL','') as
zip,
REPLACE((TRIM($7)),'NULL','') as
status,
REPLACE((TRIM($8)),'NULL','') as
Pig Script
www.mammothdata.com | @mammothdataco
Thank you for attending!
{Percona University | Raleigh}

More Related Content

Viewers also liked

Viewers also liked (16)

Spark Summit EU talk by Emlyn Whittick
Spark Summit EU talk by Emlyn WhittickSpark Summit EU talk by Emlyn Whittick
Spark Summit EU talk by Emlyn Whittick
 
Spark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug GrallSpark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug Grall
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
 
Spark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan KesslerSpark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan Kessler
 
Spark Summit EU talk by Johnathan Mercer
Spark Summit EU talk by Johnathan MercerSpark Summit EU talk by Johnathan Mercer
Spark Summit EU talk by Johnathan Mercer
 
Spark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat Patterson
 
Spark Summit EU talk by Shaun Klopfenstein and Neelesh Shastry
Spark Summit EU talk by Shaun Klopfenstein and Neelesh ShastrySpark Summit EU talk by Shaun Klopfenstein and Neelesh Shastry
Spark Summit EU talk by Shaun Klopfenstein and Neelesh Shastry
 
Spark Summit EU talk by Stavros kontopoulos and Justin Pihony
Spark Summit EU talk by Stavros kontopoulos and Justin PihonySpark Summit EU talk by Stavros kontopoulos and Justin Pihony
Spark Summit EU talk by Stavros kontopoulos and Justin Pihony
 
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy StarzhinskySpark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
 
Spark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Brij Bhushan RavatSpark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Brij Bhushan Ravat
 
Spark Summit EU talk by William Benton
Spark Summit EU talk by William BentonSpark Summit EU talk by William Benton
Spark Summit EU talk by William Benton
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 

Recently uploaded

Recently uploaded (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 

Percona University - Everything You Need To Know About Hadoop Now

  • 1. The Leader in Big Data Consulting
  • 2. www.mammothdata.com | @mammothdataco Everything You Need To Know About Hadoop Now {Percona University | Raleigh}
  • 3. www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco PaaKow Acquah, Lead Consultant Joined OSI January 2008 Currently leading a Data Lake design and implementation for a large Californian utility company
  • 4. www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco Open Software Integrators Open Software Integrators is a Big Data consulting and services company specializing in Hadoop, Cassandra, MongoDB and other NoSQL technologies. OSI focuses on executive strategy, initial install, design and implementation. Founded January 2008 by Andrew C. Oliver Based in downtown Durham, NC Partnered with Hortonworks, MongoDB, DataStax, Cloudera, Couchbase, Cloudbees & Neo Technology
  • 5. www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco Overview What is Hadoop anyhow? What is Hadoop Good For? What isn’t it good for? How do you get data into Hadoop? How do you get data out of Hadoop? How do you process data in Hadoop? How do you analyze data in Hadoop? How do you secure Hadoop?
  • 6. www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco But first... This is an overview talk intended as a roadmap to point you at the most important bits to learn on the way… It is not comprehensive training… It is not an in-depth look at any part of Hadoop It is a rather high level selective overview of the Hadoop ecosystem
  • 7. www.mammothdata.com | @mammothdataco What Is Hadoop Anyhow? {Percona University | Raleigh}
  • 8. www.mammothdata.com | @mammothdataco A platform for distributed computing 2011 HDFS Hive 2012 HDFS YARN Hive HBase 2014 HDFS Hive Yarn HBase Spark Storm Kafka Mahout Squoop Oozie ... Hadoop is...
  • 9. www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco Hadoop is... HDFS Distributed Filesystem similar to Gluster, Ceph, etc. You can use other distributed filesystems in place of HDFS Blocks are distributed, and by default duplicated on at least 1 other node 128m default block size Restful API, CLI tools, third-party tools to “mount” HDFS on Linux (stable), Windows (ymmv), Mac (?) DO NOT PUT YOUR DATA NODES ON A SAN! IT IS WRONG! DO NOT DO IT! EVEN ON THURSDAY!
  • 10. www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco Hadoop is... YARN Yet another resource negotiator schedules “work” among nodes, distributes the “processing” Map Reduce is an API an algorithm, data is mapped to nodes, the answers are “reduced” to a single answer Hive is HDFS/Hadoop based data warehousing SQL, JDBC, ODBC Tables map to files on HDFS No updates, deletes, transactions (but coming in “Stinger.next”)
  • 11. www.mammothdata.com | @mammothdataco HBase a column family database ACID relatively low-latency And a whole lot more Hadoop is...
  • 12. www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco Hadoop is... An ecosystem of tools for distributed processing and storage of data.
  • 13. www.mammothdata.com | @mammothdataco What Is Hadoop Good For? {Percona University | Raleigh}
  • 14. www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco What Is Hadoop Good For? Working with large amounts of data in batch ETL processing / Data Transformation Analytics / BI Integration (Data Lake, Enterprise Data Hub) Working with streams of data Events Log data Time series or similar data (HBase)
  • 15. www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco What Is Hadoop Bad At? What is Hadoop bad at? Quick jobs - i.e. Hive/Map Reduce setup time is measured in seconds to minutes. Lots of small files (128MB block size = 0 byte files are 128m files) General DBMS stuff - HBase is a much more “specific” database than MySQL/etc. High Availability WHA??? Knox, Oozie, etc all have shaky support if any for HA Namenodes.
  • 16. www.mammothdata.com | @mammothdataco How Do You Get Data Into/Out Of Hadoop? {Percona University | Raleigh}
  • 17. www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco How Do You Get Data Into Hadoop? How do you get data into Hadoop? Sqoop it from an RDBMS Use JDBC or ODBC and push into Hive from an external DB Push data into Hive with the restful API Put an extract file onto HDFS with the REST API process it into Hive directly with a LOAD DATA statement transform/process it into Hive using PIG use Java Message it in there with Kafka, RabbitMQ or similar MQ and custom “spout” for Storm Use any multitude of APIs that write data into HDFS, HBase, Hive, etc.
  • 18. www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco How Do You Get Data Out Of Hadoop? How do you get data out of Hadoop? Should you be getting it out or should you process it there? JDBC/ODBC to Hive HBase can be mounted into Hive REST APIs for Hive/HDFS APIs for Kafka, Spark, Storm, etc (subscribe) HDCP to another HDFS Mount it with FUSE and use your favorite Linux tool hadoop fs -cat /path/to/file/on/hdfs |grep stuff > mynewlocalfile
  • 19. www.mammothdata.com | @mammothdataco How Do You Process Data In Hadoop? {Percona University | Raleigh}
  • 20. www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco How Do You Process Data In Hadoop? Map-reduce Java API Hive supports SQL (soon to be not a subset) PIG can munge files on HDFS and can work with Hive Storm and Spark have their own APIs for dealing with events or so-called micro-batches of data There are numerous toolkits Mahout - common machine learning algorithms (many not very parallelizable/etc) MLib - Machine learning built on Spark GraphX
  • 21. www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco How Do You Analyze Data In Hadoop? Most major BI tools now support Hadoop Tableau Pentaho Datameer Your favorite probably here All that stuff is for l4m3rs, use the command line interface :-) hive -e ‘select * from sometable’ pig hdfs://some/dir/myscript.pig Use RStudio and write some R to predict what sales will be next month (you will be sort of wrong probably) Use your favorite SQL tool that supports JDBC/ODBC Use Hue
  • 22. www.mammothdata.com | @mammothdataco How Do You Secure Hadoop? {Percona University | Raleigh}
  • 23. www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco How Do You Secure Hadoop? HDFS supports POSIX (that means Linux-style) filesystem security The most complete security authentication throughout Hadoop is based on Kerberos (yeah I know). You can do it with just straight LDAP too, but it isn’t integrated. Knox supplies “perimeter-based security” for (only): Hive HDFS Ooozie HBase HCatalog Supposedly Argus will save us from all of this!
  • 24. www.mammothdata.com | @mammothdataco Other Considerations {Percona University | Raleigh}
  • 25. www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco Cacophony Disaster Recovery Falcon (alpha quality) Workflow Flume Schedule/trigger/orchestrate those ETL jobs Oozie Install, configure, monitor Hadoop Ambari Use tables in both Pig and Hive HCatalog
  • 29. www.mammothdata.com | @mammothdataco REGISTER file:///usr/lib/pig/piggybank.jar; define SUBSTRING org.apache.pig.piggybank.evaluation .string.SUBSTRING(); rows = load '$FILEPATH' using org.apache.pig.piggybank.storage.CS VExcelStorage('u001a') as ( a0:chararray, a1:chararray, a2:chararray, a3:chararray, a4:chararray, a5:chararray, a6:chararray, a7:chararray, a8:chararray, row = foreach rows GENERATE REPLACE((TRIM($0)),'NULL','') as orderid, REPLACE((TRIM($1)),'NULL','') as customerid, REPLACE((TRIM($2)),'NULL','') as customername, REPLACE((TRIM($3)),'NULL','') as address, REPLACE((TRIM($4)),'NULL','') as city, REPLACE((TRIM($5)),'NULL','') as state, REPLACE((TRIM($6)),'NULL','') as zip, REPLACE((TRIM($7)),'NULL','') as status, REPLACE((TRIM($8)),'NULL','') as Pig Script
  • 30. www.mammothdata.com | @mammothdataco Thank you for attending! {Percona University | Raleigh}