SlideShare a Scribd company logo
ianmas@amazon.com
@IanMmmm
LARGE SCALE DATA
ANALYSIS WITH AWS



Ian Massingham – Technical Evangelist
THE MORE DATA YOU COLLECT
THE MORE VALUE YOU CAN
DERIVE FROM IT!
THE COST OF DATA
GENERATION IS FALLING!
We are constantly producing more data
From all types of industries
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
Lower cost,
higher throughput
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
Lower cost,
higher throughput
Highly
constrained
+ ELASTIC AND HIGHLY SCALABLE
+ NO UPFRONT CAPITAL EXPENSE
+ ONLY PAY FOR WHAT YOU USE
+ AVAILABLE ON-DEMAND
= REMOVE CONSTRAINTS
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
AWS Import / Export
AWS Direct Connect
Inbound data transfer is free
Multipart upload to S3
Physical media
AWS Direct Connect
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
Amazon S3,
Amazon Glacier,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
AWS Storage Gateway,
Data on Amazon EC2
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
Amazon EC2
Amazon Elastic
MapReduce
AMAZON ELASTIC
MAPREDUCE

HADOOP AS A SERVICE!
•  SPLITS DATA INTO PIECES
•  LETS PROCESSING OCCUR
•  GATHERS THE RESULTS!
HDFS
EMRKinesis
S3 DynamoDB
Data management
Pig
Analytics languages/engines
RDS
Redshift AWS Data Pipeline
EMR + IMPALA DEMO
STARTING AN EMR CLUSTER
WITH HADOOP ECOSYSTEM
TOOLS PRE-INSTALLED
COPY & LOAD OUR DATASET
$	
  scp	
  –i	
  EMRKeyPair.pem	
  ~/aws/hadoop/LHRarrivals*.csv	
  hadoop@ec2-­‐54-­‐76-­‐242-­‐238.eu-­‐
west-­‐1.compute.amazonaws.com:	
  
	
  
$	
  ssh	
  –i	
  EMRKeyPair.pem	
  hadoop@ec2-­‐54-­‐76-­‐242-­‐238.eu-­‐west-­‐1.compute.amazonaws.com	
  
	
  
$	
  hadoop	
  fs	
  -­‐mkdir	
  /data/	
  
$	
  hadoop	
  fs	
  -­‐put	
  <uploaded_files>	
  /data/	
  
$	
  hadoop	
  fs	
  -­‐ls	
  -­‐h	
  -­‐R	
  /data/	
  
	
  
or at scale, Distributed Copy using S3DistCp to parallel load from S3
	
  
$	
  .	
  /home/hadoop/impala/conf/impala.conf	
  
$	
  hadoop	
  jar	
  /home/hadoop/lib/emr-­‐s3distcp-­‐1.0.jar	
  -­‐Dmapreduce.job.reduces=30	
  -­‐-­‐
src	
  s3://s3bucketname/	
  -­‐-­‐dest	
  hdfs://$HADOOP_NAMENODE_HOST:$HADOOP_NAMENODE_PORT/
data/	
  -­‐-­‐outputCodec	
  'none'	
  
	
  
** Run on a cluster master node
CREATE EXTERNAL TABLE
$	
  #check	
  the	
  size	
  of	
  our	
  data	
  set	
  
$	
  wc	
  –l	
  LHRarrivals*.csv	
  	
  
	
  
	
  850	
  LHRarrivals2.csv	
  
	
  1526	
  LHRarrivals.csv	
  
	
  	
   	
  2376	
  total	
  
	
  
$	
  impala-­‐shell	
  
	
  
Welcome	
  to	
  the	
  Impala	
  shell.	
  
	
  
>	
  create	
  EXTERNAL	
  TABLE	
  flights	
  (	
  input	
  STRING,	
  id	
  BIGINT,	
  widget	
  STRING,	
  source	
  
STRING,	
  resultnum	
  BIGINT,	
  pageurl	
  STRING,	
  scheduled	
  STRING,	
  flightnumber	
  STRING,	
  
airport	
  STRING,	
  status	
  STRING,	
  terminal	
  STRING	
  )	
  ROW	
  FORMAT	
  DELIMITED	
  FIELDS	
  
TERMINATED	
  BY	
  ','	
  LOCATION	
  '/data/';	
  
>	
  select	
  count	
  (*)	
  from	
  flights;	
  
	
  
Should	
  return	
  count(*)	
  2376	
  reflecting	
  the	
  size	
  of	
  the	
  data	
  set	
  
DEMO OF ODBC ACCESS
Doing this part on Amazon WorkSpaces using the Simba Cloudera
Impala ODBC Driver.!
Set up an SSH tunnel to the master node to allow us to connect to port
25010 from the WorkSpaces desktop to the Impala ODBC port!
A previously configured system DSN allows us to work with the data from
our EMR/Impala cluster directly within Microsoft Excel!
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
Amazon S3,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
Data on Amazon EC2
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
BATCH
PROCESSING
GENERATE ➔ ➔ SHARE!
STREAM
PROCESSING
AMAZON KINESIS

REAL-TIME DATA STREAM PROCESSING!
Real-time response to content
in semi-structured data streams



Relatively simple computations
on data (aggregates, filters,
sliding window, etc.)
Hourly server logs: how your
systems went wrong an hour ago
Weekly / Monthly Bill: What you
spent this past billing cycle
Daily customer report from your
website: tells you what deal or ad
to try next time
Daily fraud reports: tells you if there
was fraud yesterday
Daily business reports: tells me
how customers used AWS services
yesterday
Real-time metrics: what just went
wrong now
Real-time spending alerts/caps:
guaranteeing you can’t overspend
Real-time analysis: what to offer
the current customer now
Real-time detection: blocks
fraudulent use now
Fast ETL into Amazon Redshift:
how are customers using services
now
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
Amazon S3,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
Data on Amazon EC2
Amazon EC2
Amazon Elastic
MapReduce
Amazon S3,
Amazon Glacier,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
AWS Storage Gateway,
Data on Amazon EC2
AWS Import / Export
AWS Direct Connect
GENERATE ➔ ➔ SHARE!
STREAM
PROCESSING
GENERATE ➔ ➔ SHARE!
STREAM
PROCESSING
Amazon S3,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
Data on Amazon EC2
Amazon Kinesis
Stream Processing on
Amazon EC2
WANT TO KNOW MORE?
aws.amazon.com/solutions/case-studies/big-data/!
ianmas@amazon.com
@IanMmmm
LARGE SCALE DATA
ANALYSIS WITH AWS



Ian Massingham – Technical Evangelist

More Related Content

What's hot

What's hot (20)

AWS Cloud Watch
AWS Cloud WatchAWS Cloud Watch
AWS Cloud Watch
 
Modern Data Architectures for Business Insights at Scale
Modern Data Architectures for Business Insights at Scale Modern Data Architectures for Business Insights at Scale
Modern Data Architectures for Business Insights at Scale
 
Cost Optimisation with AWS
Cost Optimisation with AWSCost Optimisation with AWS
Cost Optimisation with AWS
 
Your First Data Lake on AWS_Simon Elisha
Your First Data Lake on AWS_Simon ElishaYour First Data Lake on AWS_Simon Elisha
Your First Data Lake on AWS_Simon Elisha
 
AWSome Day London January 2016 Intro
AWSome Day London January 2016 IntroAWSome Day London January 2016 Intro
AWSome Day London January 2016 Intro
 
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
 
Soluzioni di Database completamente gestite: NoSQL, relazionali e Data Warehouse
Soluzioni di Database completamente gestite: NoSQL, relazionali e Data WarehouseSoluzioni di Database completamente gestite: NoSQL, relazionali e Data Warehouse
Soluzioni di Database completamente gestite: NoSQL, relazionali e Data Warehouse
 
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
 
What's New & What's Next from AWS?
What's New & What's Next from AWS?What's New & What's Next from AWS?
What's New & What's Next from AWS?
 
Workshop: Building Your First Big Data Application on AWS
Workshop: Building Your First Big Data Application on AWSWorkshop: Building Your First Big Data Application on AWS
Workshop: Building Your First Big Data Application on AWS
 
NEW LAUNCH! Introducing AWS Batch: Easy and efficient batch computing
 	  NEW LAUNCH! Introducing AWS Batch: Easy and efficient batch computing 	  NEW LAUNCH! Introducing AWS Batch: Easy and efficient batch computing
NEW LAUNCH! Introducing AWS Batch: Easy and efficient batch computing
 
Visualizing Big Data Insights with Amazon QuickSight
Visualizing Big Data Insights with Amazon QuickSightVisualizing Big Data Insights with Amazon QuickSight
Visualizing Big Data Insights with Amazon QuickSight
 
Cost Optimization at Scale
Cost Optimization at ScaleCost Optimization at Scale
Cost Optimization at Scale
 
Getting Started with Amazon QuickSight
Getting Started with Amazon QuickSightGetting Started with Amazon QuickSight
Getting Started with Amazon QuickSight
 
Analisi dei dati con AWS: una panoramica degli strumenti disponibili
Analisi dei dati con AWS: una panoramica degli strumenti disponibiliAnalisi dei dati con AWS: una panoramica degli strumenti disponibili
Analisi dei dati con AWS: una panoramica degli strumenti disponibili
 
Structured, Unstructured and Streaming Big Data on the AWS
Structured, Unstructured and Streaming Big Data on the AWSStructured, Unstructured and Streaming Big Data on the AWS
Structured, Unstructured and Streaming Big Data on the AWS
 
AWS re:Invent 2016: How Fulfillment by Amazon (FBA) and Scopely Improved Resu...
AWS re:Invent 2016: How Fulfillment by Amazon (FBA) and Scopely Improved Resu...AWS re:Invent 2016: How Fulfillment by Amazon (FBA) and Scopely Improved Resu...
AWS re:Invent 2016: How Fulfillment by Amazon (FBA) and Scopely Improved Resu...
 
Intro Presentation at AWS AWSome Day Glasgow September 2015
Intro Presentation at AWS AWSome Day Glasgow September 2015Intro Presentation at AWS AWSome Day Glasgow September 2015
Intro Presentation at AWS AWSome Day Glasgow September 2015
 
利用 Amazon QuickSight 視覺化分析服務剖析資料
利用 Amazon QuickSight 視覺化分析服務剖析資料利用 Amazon QuickSight 視覺化分析服務剖析資料
利用 Amazon QuickSight 視覺化分析服務剖析資料
 

Similar to 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big DataAWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Amazon Web Services
 

Similar to 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo (20)

Cloud World Forum: Large Scale Data Analysis on AWS
Cloud World Forum: Large Scale Data Analysis on AWSCloud World Forum: Large Scale Data Analysis on AWS
Cloud World Forum: Large Scale Data Analysis on AWS
 
Workshop part2 – Big Data
Workshop part2 – Big DataWorkshop part2 – Big Data
Workshop part2 – Big Data
 
Data Analytics on AWS
Data Analytics on AWSData Analytics on AWS
Data Analytics on AWS
 
Large Scale Data Analysis with AWS
Large Scale Data Analysis with AWSLarge Scale Data Analysis with AWS
Large Scale Data Analysis with AWS
 
Journey Through the AWS Cloud - Big Data Analysis
Journey Through the AWS Cloud - Big Data AnalysisJourney Through the AWS Cloud - Big Data Analysis
Journey Through the AWS Cloud - Big Data Analysis
 
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big DataAWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Loading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SFLoading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SF
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Loading Data into Redshift with Lab
Loading Data into Redshift with LabLoading Data into Redshift with Lab
Loading Data into Redshift with Lab
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Loading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF LoftLoading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF Loft
 
Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS
 
Loading Data into Amazon Redshift
Loading Data into Amazon RedshiftLoading Data into Amazon Redshift
Loading Data into Amazon Redshift
 
Big Data: Mejores prácticas en AWS
Big Data: Mejores prácticas en AWSBig Data: Mejores prácticas en AWS
Big Data: Mejores prácticas en AWS
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data
 AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data
AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data
 
Build your own CDN with Varnish - Confoo 2022
Build your own CDN with Varnish - Confoo 2022Build your own CDN with Varnish - Confoo 2022
Build your own CDN with Varnish - Confoo 2022
 
B3 - Business intelligence apps on aws
B3 - Business intelligence apps on awsB3 - Business intelligence apps on aws
B3 - Business intelligence apps on aws
 

More from Ian Massingham

More from Ian Massingham (20)

Some thoughts on measuring the impact of developer relations
Some thoughts on measuring the impact of developer relationsSome thoughts on measuring the impact of developer relations
Some thoughts on measuring the impact of developer relations
 
Leeds IoT Meetup - Nov 2017
Leeds IoT Meetup - Nov 2017Leeds IoT Meetup - Nov 2017
Leeds IoT Meetup - Nov 2017
 
DevTalks Romania - Getting Started with AWS Lambda & the Serverless Cloud
DevTalks Romania - Getting Started with AWS Lambda & the Serverless CloudDevTalks Romania - Getting Started with AWS Lambda & the Serverless Cloud
DevTalks Romania - Getting Started with AWS Lambda & the Serverless Cloud
 
Getting started with AWS Lambda and the Serverless Cloud
Getting started with AWS Lambda and the Serverless CloudGetting started with AWS Lambda and the Serverless Cloud
Getting started with AWS Lambda and the Serverless Cloud
 
AWS AWSome Day - Getting Started Best Practices
AWS AWSome Day - Getting Started Best PracticesAWS AWSome Day - Getting Started Best Practices
AWS AWSome Day - Getting Started Best Practices
 
AWS IoT Workshop Keynote
AWS IoT Workshop KeynoteAWS IoT Workshop Keynote
AWS IoT Workshop Keynote
 
Security Best Practices: AWS AWSome Day Management Track
Security Best Practices: AWS AWSome Day Management TrackSecurity Best Practices: AWS AWSome Day Management Track
Security Best Practices: AWS AWSome Day Management Track
 
AWS re:Invent 2016 Day 2 Keynote re:Cap
AWS re:Invent 2016 Day 2 Keynote re:CapAWS re:Invent 2016 Day 2 Keynote re:Cap
AWS re:Invent 2016 Day 2 Keynote re:Cap
 
AWS re:Invent 2016 Day 1 Keynote re:Cap
AWS re:Invent 2016 Day 1 Keynote re:CapAWS re:Invent 2016 Day 1 Keynote re:Cap
AWS re:Invent 2016 Day 1 Keynote re:Cap
 
Getting Started with AWS Lambda & Serverless Cloud
Getting Started with AWS Lambda & Serverless CloudGetting Started with AWS Lambda & Serverless Cloud
Getting Started with AWS Lambda & Serverless Cloud
 
Building Better IoT Applications without Servers
Building Better IoT Applications without ServersBuilding Better IoT Applications without Servers
Building Better IoT Applications without Servers
 
AWS AWSome Day Roadshow
AWS AWSome Day RoadshowAWS AWSome Day Roadshow
AWS AWSome Day Roadshow
 
AWS AWSome Day Roadshow Intro
AWS AWSome Day Roadshow IntroAWS AWSome Day Roadshow Intro
AWS AWSome Day Roadshow Intro
 
Hashiconf AWS Lambda Breakout
Hashiconf AWS Lambda BreakoutHashiconf AWS Lambda Breakout
Hashiconf AWS Lambda Breakout
 
Getting started with AWS IoT on Raspberry Pi
Getting started with AWS IoT on Raspberry PiGetting started with AWS IoT on Raspberry Pi
Getting started with AWS IoT on Raspberry Pi
 
AWSome Day Dublin Intro & Closing Slides
AWSome Day Dublin Intro & Closing Slides AWSome Day Dublin Intro & Closing Slides
AWSome Day Dublin Intro & Closing Slides
 
GOTO Stockholm - AWS Lambda - Logic in the cloud without a back-end
GOTO Stockholm - AWS Lambda - Logic in the cloud without a back-endGOTO Stockholm - AWS Lambda - Logic in the cloud without a back-end
GOTO Stockholm - AWS Lambda - Logic in the cloud without a back-end
 
What's New at AWS Update for AWS User Groups
What's New at AWS Update for AWS User Groups What's New at AWS Update for AWS User Groups
What's New at AWS Update for AWS User Groups
 
Advanced Security Masterclass - Tel Aviv Loft
Advanced Security Masterclass - Tel Aviv LoftAdvanced Security Masterclass - Tel Aviv Loft
Advanced Security Masterclass - Tel Aviv Loft
 
Security Best Practices
Security Best PracticesSecurity Best Practices
Security Best Practices
 

Recently uploaded

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 

Recently uploaded (20)

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 

2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

  • 1. ianmas@amazon.com @IanMmmm LARGE SCALE DATA ANALYSIS WITH AWS
 
 Ian Massingham – Technical Evangelist
  • 2. THE MORE DATA YOU COLLECT THE MORE VALUE YOU CAN DERIVE FROM IT!
  • 3.
  • 4.
  • 5. THE COST OF DATA GENERATION IS FALLING!
  • 6. We are constantly producing more data
  • 7. From all types of industries
  • 8.
  • 9.
  • 10. GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
  • 11. GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE! Lower cost, higher throughput
  • 12. GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE! Lower cost, higher throughput Highly constrained
  • 13. + ELASTIC AND HIGHLY SCALABLE + NO UPFRONT CAPITAL EXPENSE + ONLY PAY FOR WHAT YOU USE + AVAILABLE ON-DEMAND = REMOVE CONSTRAINTS
  • 14. GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
  • 15. GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE! AWS Import / Export AWS Direct Connect
  • 16. Inbound data transfer is free Multipart upload to S3 Physical media AWS Direct Connect
  • 17. GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE! Amazon S3, Amazon Glacier, Amazon DynamoDB, Amazon RDS, Amazon Redshift, AWS Storage Gateway, Data on Amazon EC2
  • 18. GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE! Amazon EC2 Amazon Elastic MapReduce
  • 19.
  • 21. •  SPLITS DATA INTO PIECES •  LETS PROCESSING OCCUR •  GATHERS THE RESULTS!
  • 22. HDFS EMRKinesis S3 DynamoDB Data management Pig Analytics languages/engines RDS Redshift AWS Data Pipeline
  • 23. EMR + IMPALA DEMO
  • 24. STARTING AN EMR CLUSTER WITH HADOOP ECOSYSTEM TOOLS PRE-INSTALLED
  • 25. COPY & LOAD OUR DATASET $  scp  –i  EMRKeyPair.pem  ~/aws/hadoop/LHRarrivals*.csv  hadoop@ec2-­‐54-­‐76-­‐242-­‐238.eu-­‐ west-­‐1.compute.amazonaws.com:     $  ssh  –i  EMRKeyPair.pem  hadoop@ec2-­‐54-­‐76-­‐242-­‐238.eu-­‐west-­‐1.compute.amazonaws.com     $  hadoop  fs  -­‐mkdir  /data/   $  hadoop  fs  -­‐put  <uploaded_files>  /data/   $  hadoop  fs  -­‐ls  -­‐h  -­‐R  /data/     or at scale, Distributed Copy using S3DistCp to parallel load from S3   $  .  /home/hadoop/impala/conf/impala.conf   $  hadoop  jar  /home/hadoop/lib/emr-­‐s3distcp-­‐1.0.jar  -­‐Dmapreduce.job.reduces=30  -­‐-­‐ src  s3://s3bucketname/  -­‐-­‐dest  hdfs://$HADOOP_NAMENODE_HOST:$HADOOP_NAMENODE_PORT/ data/  -­‐-­‐outputCodec  'none'     ** Run on a cluster master node
  • 26. CREATE EXTERNAL TABLE $  #check  the  size  of  our  data  set   $  wc  –l  LHRarrivals*.csv        850  LHRarrivals2.csv    1526  LHRarrivals.csv        2376  total     $  impala-­‐shell     Welcome  to  the  Impala  shell.     >  create  EXTERNAL  TABLE  flights  (  input  STRING,  id  BIGINT,  widget  STRING,  source   STRING,  resultnum  BIGINT,  pageurl  STRING,  scheduled  STRING,  flightnumber  STRING,   airport  STRING,  status  STRING,  terminal  STRING  )  ROW  FORMAT  DELIMITED  FIELDS   TERMINATED  BY  ','  LOCATION  '/data/';   >  select  count  (*)  from  flights;     Should  return  count(*)  2376  reflecting  the  size  of  the  data  set  
  • 27. DEMO OF ODBC ACCESS Doing this part on Amazon WorkSpaces using the Simba Cloudera Impala ODBC Driver.! Set up an SSH tunnel to the master node to allow us to connect to port 25010 from the WorkSpaces desktop to the Impala ODBC port! A previously configured system DSN allows us to work with the data from our EMR/Impala cluster directly within Microsoft Excel!
  • 28. GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE! Amazon S3, Amazon DynamoDB, Amazon RDS, Amazon Redshift, Data on Amazon EC2
  • 29. GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
  • 30. GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE! BATCH PROCESSING
  • 31. GENERATE ➔ ➔ SHARE! STREAM PROCESSING
  • 32. AMAZON KINESIS
 REAL-TIME DATA STREAM PROCESSING!
  • 33. Real-time response to content in semi-structured data streams
 
 Relatively simple computations on data (aggregates, filters, sliding window, etc.)
  • 34. Hourly server logs: how your systems went wrong an hour ago Weekly / Monthly Bill: What you spent this past billing cycle Daily customer report from your website: tells you what deal or ad to try next time Daily fraud reports: tells you if there was fraud yesterday Daily business reports: tells me how customers used AWS services yesterday Real-time metrics: what just went wrong now Real-time spending alerts/caps: guaranteeing you can’t overspend Real-time analysis: what to offer the current customer now Real-time detection: blocks fraudulent use now Fast ETL into Amazon Redshift: how are customers using services now
  • 35. GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
  • 36. GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE! Amazon S3, Amazon DynamoDB, Amazon RDS, Amazon Redshift, Data on Amazon EC2 Amazon EC2 Amazon Elastic MapReduce Amazon S3, Amazon Glacier, Amazon DynamoDB, Amazon RDS, Amazon Redshift, AWS Storage Gateway, Data on Amazon EC2 AWS Import / Export AWS Direct Connect
  • 37. GENERATE ➔ ➔ SHARE! STREAM PROCESSING
  • 38. GENERATE ➔ ➔ SHARE! STREAM PROCESSING Amazon S3, Amazon DynamoDB, Amazon RDS, Amazon Redshift, Data on Amazon EC2 Amazon Kinesis Stream Processing on Amazon EC2
  • 39. WANT TO KNOW MORE? aws.amazon.com/solutions/case-studies/big-data/!
  • 40. ianmas@amazon.com @IanMmmm LARGE SCALE DATA ANALYSIS WITH AWS
 
 Ian Massingham – Technical Evangelist