SlideShare a Scribd company logo
EXTRACTING INSIGHTS FROM BIG
DATA: A CASE OF NEW YORK CITY
YELLOW TAXI DATASET
Parag Ahire
January 11, 2020
PRESENTATION OUTLINE
Brief introduction
• Big Data & high performance computing
• Describe few techniques for high performance computing
 Compare and Contrast few techniques
Sample Dataset introduction – NYC Yellow Taxi
Apply four techniques
• High-level code review
• Demonstration on Hortonworks virtual machine on Azure
BIG DATA
 Data is growing
• Digital Age : 2002 onwards
• 2019 – 1770 Exabytes
• 2020 – 2000 Exabytes
 What is it ?
• A data set that cannot be processed by a “normal” machine in “reasonable” amount of time
• Three V’s
 Volume
 Velocity
 Variety
• May vary by time and prevalent technology
 Used to be Giga/Tera bytes
 Now Exa/Peta bytes
 Future Zeta/Yotta bytes
o Zeta byte – 1000 data centers occupying 20% of Manhattan
o Yotta byte – 1M data centers occupying Delaware and Rhode Island
HIGH PERFORMANCE COMPUTING
 The ability to process massive amount of data and perform complex
calculations at high speed
 New Challenges (7 V’s)
 Previously - Volume
 Now – Velocity, Variety, Variability, Veracity, Visualization, Value
 How to perform ?
• Supercomputers – expensive, require specialized expertise to use and
solve specialized problems
• Cluster of small or medium sized business computers
• Modern “supercomputers” are mostly “cluster of computers”
PARALLEL AND DISTRIBUTED COMPUTING
Parallel Computing – All
processors have access to
shared memory
Distributed Computing –
Each processor has its own
memory. Information is
exchanged by passing
messages between
processors
Images taken from : Wikipedia
DISTRIBUTED COMPUTING MODELS
 Parallel algorithms
 Shared-memory model
• All processors access shared memory
• Programmer decides what program is executed by each processor
 Message-passing model
• Programmer chooses
o Network structure
o Program executed by each computer
 Distributed algorithms
 Programmer chooses the computer program
 All computers run the same program
HIGH PERFORMANCE COMPUTING TECHNIQUES (HPCT)
 Map Reduce
 A framework or programming model
 Suitable for processing large volume of structured/unstructured data
 Pig
 Procedural rather than declarative coding approach
 Provides a high degree of abstraction for map reduce
 Hive
 A traditional data warehouse interface for map reduce
 Spark
 A open source big data framework
 A unified analytics engine for large-scale data processing
MAP REDUCE
 Map function
 Input – A Key Value pair
• (k1, v1) -> list(k2, v2)
 Output – A list of key value pairs (one or more elements)
 Reduce function
 Input – A Key and a list of values
• (k2, list(v2)) -> list(v2)
 Sort
 Merging and sorting of output produced in the map phase
 Shuffle
 Transfers intermediate output of map phase to reducer
 Passes on intermediate output of one or more keys to a single reducer
MAP REDUCE
 Concerns
 Map phase – done in parallel, typically 20% of the work
 Reduce phase – executed sequentially for each key, typically 80% of the work
 Tips
 Increase with work done in the map phase and leave less for the reduce phase
 Include the optional combine phase to reduce work done by the reducer
 Combine (Optional)
 A mini-reducer to summarizes mapper output record for a single key
 Reduces data transfer between mapper and reducer
 Decreases the amount of data to be processed by the reducer
MAP REDUCE
Image taken from : Data Flair Training Blog
MAP REDUCE
QUESTION : MAP REDUCE
For each unique day in a month across all months in the year 2014 print
the maximum total number of passengers across all months (across all
eligible trips) alighting (i.e. picking up) a Yellow Taxi between the hours
09:00 am (inclusive) and 10:00 am (exclusive) for a trip distance of less
than 3 miles where a tip was paid ? Print the day of the month as a
number and the total number of passengers across all eligible trips during
the month that was a maximum across all months in a month. The day of
the month should be represented as a number between 1 and 31 while
considering the maximum number of days occurring in each month of the
year 2014. Any trip data that did not have a pickup date between 1st
January 2014 and 31st December 2014 should be ignored. The day of the
month need not be sorted while printing the output.
ANSWER : MAP REDUCE
Day:CountOfPassengers
1:25
10:25
11:37
12:53
13:25
14:21
15:23
16:27
17:30
18:45
Day:CountOfPassengers
28:39
29:36
3:32
30:26
31:21
4:34
5:30
6:31
7:36
8:25
9:27
Day:CountOfPassengers
19:38
2:33
20:49
21:39
22:38
23:33
24:38
25:33
26:38
27:44
TOP HADOOP VENDORS
Amazon Elastic Map Reduce (EMR)
Cloudera* CDH Hadoop Distribution
Hortonworks* Data Platform (HDP)
MapR Hadoop Distribution
IBM Open Platform
Microsoft Azure HDInsight
Pivotal Big Data Suite
*Merged
PIG
Grew out of Yahoo
A platform for analyzing large data sets
Pig Latin – A procedural language
 Provides a sequence of data transformations
• To merge, filter, apply functions, group records
• Supports User Defined Functions (UDF) for special processing
Programs are compiled into map reduce jobs
 Support for python, java, groovy, JavaScript, ruby
PIG
Abstraction for map reduce programming
 Improves developer productivity
 Suitable for use for data analysts
Lower performance than map reduce
 Use additional machines in cluster to increase performance
Used to perform tasks for
 Data Storage
 Data Execution
 Data Manipulation
QUESTION : PIG
For all data available for the year 2014 (consider all months), which drop-
off location had the maximum total amount collected by credit card for a
trip exceeding 1 mile where no toll was paid, a tip was also paid but a
standard rate was applied for yellow taxi rides? Any trip data that did not
have a drop-off date between 1st January 2014 and 31st December 2014,
or does not have a valid month or does not have a valid day of the month
should be ignored. Print the drop off location ID (IDS) and the
aggregated total amount for the top location.
ANSWER : PIG
Drop-Off Latitude Drop-Off Longitude Sum Total Amount
40.78508 -73.95587 $65221.65
HIVE
Developed at Facebook
A SQL engine on its own meta store on HDFS
 Can be queried though HQL (Hive Query Language)
Provides a traditional data ware house interface
Hive compiler
 Converts hive queries to map reduce programs
 Executed in parallel across machines in the Hadoop cluster
HIVE
 Abstraction for map reduce programming
 Improves developer productivity
 Suitable for individuals with a SQL background
 Lower performance than map reduce
 Use additional machines in cluster to increase performance
 Supports User Defined Functions (UDF’s)
 Used for processing structured data
 Data is loaded in tables
 Unstructured data needs to be structured
 Data is then loaded to tables
QUESTION : HIVE
Which three pairs of pickup location / drop off location had the largest ratio of
total amount paid per passenger for trips taken by a yellow taxi for all data
available for the year 2014? Only trips that utilized a payment type of credit
card and utilized a standard rate code should be considered. Any trip data that
did not have a drop-off date between 1st January 2014 and 31st December
2014, or does not have a valid month or does not have a valid day of the
month should be ignored. Print the rank, pickup location, drop off location and
the ratio of total amount paid to the passenger count for these three top pairs
of pickup / drop off locations. Locations should be printed in descending order
of the ratio of total amount paid to the passenger count. The pickup location
and drop off location should be printed as a string of the form
"latitude:longitude" based on the latitude and longitude of the pick-up location
or drop off location. A dense ranking should be performed.
ANSWER : HIVE
RANK Pickup Pickup
Longitude
Drop-Off
Latitude
Drop-Off
Longitude
Ratio of Total
Amount to
Total
Count
1 40.72941 -73.98386 41.30529 -72.92268 $401.5
2 40.73249 -73.98791 40.72129 -73.95615 $354.25
3 40.67019 -73.91853 40.87084 -73.90391 $354.0
COMPARISON – MAP REDUCE, PIG, HIVE
MAP REDUCE PIG HIVE
Compiled Language Scripting Language Query Language
Lower level of abstraction Higher level of abstraction Higher level of abstraction
Higher learning curve Lower learning curve Lowest learning curve
Best performance for very large
data
Intermediate performance for
very large data(50 % lower)
Least performance for very large
data
Programmer writes more lines of
code
Programmer writes intermediate
lines of code
Programmer writes least lines of
code
Highest code efficiency (more
flexibility)
Relatively less code efficiency
(lesser flexibility)
Relatively less code efficiency
(lesser flexibility)
Possible to handle unstructured
data
Not very friendly with
unstructured data like images
Not very friendly with
unstructured data like images
Possible to deal with poor
schema design of xml, json
Cannot deal with poor design of
xml, json
Not easy to deal with poor
design of xml, json
More potential of introducing
defects due to having to write
very custom code
Limited possibility of introducing
defects due to fixed syntactic
possibilities
Limited possibility of introducing
defects due to fixed syntactic
possibilities
SPARK
 Developed at UC Berkeley AMPLab
 An open source big data framework
 Utilizes DAG (Directed Acyclic Graph) programming style
 Now maintained by non-profit Apache Software Foundation
 An unified analytics engine for
 Large scale processing
 Faster, general purpose processing
 Reduces read/write operations from/to disk
 Intermediate data stored in memory to achieve speed
 RDD’s (Resilient Distributed Dataset)
 DataFrame
 Used to build batch, iterative, interactive, graph and streaming
applications
SPARK
Supports cross-platform development
Programming in Scala, Java, Python, R, SQL – Core API’s
 PySpark (Python)
 SparkR
 Spark SQL (fka Shark)
Rest of the eco-system
 MLLib (Machine Learning)
 GraphX (Graph Computation)
 Spark Streaming
COMPARISON – SPARK, MAP REDUCE
CRITERIA SPARK MAP REDUCE
Written In Scala Java
License Apache 2 Apache 2
OS support Cross-platform Cross platform
Programming Languages Scala, Java, Python, R, SQL Java, C, C++, Ruby, Groovy,
Python, Perl
Lines of Code (LOC) Approximately 20,000 Approximately 120,000
Hardware Requirements Requires the use of mid to high level
hardware
Runs well on commodity
hardware
Data Storage Hadoop Distributed File System (HFDS),
Google Cloud Storage, Amazon S3,
Microsoft Azure
Hadoop Distributed File System
(HDFS), MapR, HBase
Community Strong community, one of the most
active projects at Apache
MapReduce community has
shifted to Spark
Scalability Highly scalable, one of the largest
cluster has 8K nodes
Even higher scalability, one of the
largest cluster has 14K nodes
COMPARISON – SPARK, MAP REDUCE
CRITERIA SPARK MAP REDUCE
Speed 100x faster in memory
10x faster on disk
Faster than traditional approaches
Difficulty / Ease of
use
Easy to program with the use of high level
operators (RDD’s and data frames)
Difficult due to the need to program each
and every operation
Ease of management Easy since it is a single analytics engine that
performs various tasks
It is a batch engine and needs to be
coupled with other engines like Storm,
Giraph, Impala etc. to achieve various
tasks
Fault tolerance No need to start from scratch (except for
programming errors) but some limitations
due to in memory operations
No need to start from scratch (except for
programming errors)
Data Processing
modes
Batch, Real Time, Iterative, Interactive, Graph,
Streaming
Batch
API’s and caching Caches data in memory No support for caching
SQL Support Support via Spark SQL (fka Shark) Supported via Hive
COMPARISON – SPARK, MAP REDUCE
CRITERIA SPARK MAP REDUCE
Real Time analysis Possible to handle at scale No support for real-time analysis
Streaming Spark Streaming handles streaming No support for streaming
Interactive mode Supported Not supported
Recovery Allows recovery of failed nodes by re-
computation of DAG
Resilient to system faults or failures. It is
highly tolerant system.
Latency Low High
Scheduler Due to in-memory computations it acts as
its own flow-scheduler
Requires an external job scheduler like
Oozie for its flows
Security / Access
Permission
Less secure since the only mechanism
supported is shared secret authentication
More secure because of Kerberos and
ACL’s (access control lists)
Cost Requires plenty of RAM for in-memory
computations, so increases costs as cluster
size increases
It is cheaper in terms of cost
Category Choice Choice of data scientists since it is a
complete analytics engine
Choice of data engineers since it is a
basic data processing engine
QUESTION : SPARK
Which day (or days) across all months in the year 2014 yielded the largest total
tip amount (across all eligible trips) as a percentage of the total amount (across
all eligible trips) for trips that charged the standard rate on a Yellow Taxi where
the total amount for each trip exceeded 5 and no toll was paid? Print the day
(or days) of the month (only the day ranging from 1 to 31) in 2014 and the
total tip amount as a percentage of the total amount. Utilize the pickup date
time for deciding which day of the month that the trip counts against. The drop
off datetime need not be considered. Any trip data that did not have a pickup
date between 1st January 2014 and 31st December 2014, or does not have a
valid month or does not have a valid day of the month should be ignored.
ANSWER : SPARK
PICK UP DAY OF THE MONTH PERCENTAGE OF SUM TIP AMOUNT
TO SUM TOTAL AMOUNT
10 9.991076
ALTERNATIVES
TECHNIQUE ALTERNATIVE TECHNIQUE
Map Reduce Apache Spark
Pig Apache Spark
Hive Apache Spark, Impala, HAWK, Spark SQL, Shark
Pivotal HDB (fka HAWK)
PrestoDB (Facebook)
BigSQL (IBM)
BigQuery (Google)
Spark Apache Storm, Flume
Cassandra
Amazon Kinesis
Splunk
Elasticsearch
Koalas (Databricks)
Vaex – python library for lazy Out-Of-Code data frames
REFERENCES
What is Big Data?
Data Center storage capacity worldwide from 2016 to 2021, by segment
How big is a Yottabyte?
What is High Performance Computing?
The 7 V’s of Big Data
Distributed Computing
Hadoop Combiner – Best Explanation to MapReduce Combiner
Pig Documentation
UC Berkeley AMPLab
NYC TLC Trip Record Data
Map Reduce vs Pig vs Hive
Spark vs Hadoop MapReduce: Which big data framework to choose
Apache Spark vs Hadoop MapReduce – Feature Wise Comparison
Spark vs Hadoop MapReduce
MapReduce vs Spark – 20 Useful Comparisons To Learn
Spark vs Hadoop : Which is the Best Big Data Framework

More Related Content

What's hot

Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
Monitoring with Ganglia
Monitoring with GangliaMonitoring with Ganglia
Monitoring with Ganglia
Fastly
 
How to Take Advantage of an Enterprise Data Warehouse in the Cloud
How to Take Advantage of an Enterprise Data Warehouse in the CloudHow to Take Advantage of an Enterprise Data Warehouse in the Cloud
How to Take Advantage of an Enterprise Data Warehouse in the Cloud
Denodo
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Databricks
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
Wes McKinney
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
Databricks
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
Chris Hoyean Song
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Some Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdfSome Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdf
Michael Kogan
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
Whiteklay
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
Jeffrey T. Pollock
 
Oracle APEX, Oracle Autonomous Database, Always Free Oracle Cloud Services
Oracle APEX, Oracle Autonomous Database, Always Free Oracle Cloud ServicesOracle APEX, Oracle Autonomous Database, Always Free Oracle Cloud Services
Oracle APEX, Oracle Autonomous Database, Always Free Oracle Cloud Services
Michael Hichwa
 
ORC: 2015 Faster, Better, Smaller
ORC: 2015 Faster, Better, SmallerORC: 2015 Faster, Better, Smaller
ORC: 2015 Faster, Better, Smaller
DataWorks Summit
 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
Dremio Corporation
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
Adam Doyle
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
HostedbyConfluent
 
How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...
HostedbyConfluent
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache Sqoop
Avkash Chauhan
 

What's hot (20)

Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Monitoring with Ganglia
Monitoring with GangliaMonitoring with Ganglia
Monitoring with Ganglia
 
How to Take Advantage of an Enterprise Data Warehouse in the Cloud
How to Take Advantage of an Enterprise Data Warehouse in the CloudHow to Take Advantage of an Enterprise Data Warehouse in the Cloud
How to Take Advantage of an Enterprise Data Warehouse in the Cloud
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Some Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdfSome Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdf
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Oracle APEX, Oracle Autonomous Database, Always Free Oracle Cloud Services
Oracle APEX, Oracle Autonomous Database, Always Free Oracle Cloud ServicesOracle APEX, Oracle Autonomous Database, Always Free Oracle Cloud Services
Oracle APEX, Oracle Autonomous Database, Always Free Oracle Cloud Services
 
ORC: 2015 Faster, Better, Smaller
ORC: 2015 Faster, Better, SmallerORC: 2015 Faster, Better, Smaller
ORC: 2015 Faster, Better, Smaller
 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
 
How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache Sqoop
 

Similar to High Performance Computing on NYC Yellow Taxi Data Set

Download It
Download ItDownload It
Download It
butest
 
REX Hadoop et R
REX Hadoop et RREX Hadoop et R
REX Hadoop et R
pkernevez
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
ivascucristian
 
Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?
TerrierTeam
 
Mr bi
Mr biMr bi
Mr bi
renjan131
 
Big Data Pipelines and Machine Learning at Uber
Big Data Pipelines and Machine Learning at UberBig Data Pipelines and Machine Learning at Uber
Big Data Pipelines and Machine Learning at Uber
Sudhir Tonse
 
From Data to Maps to Docs: Turn Days into Minutes with Automated Integration
From Data to Maps to Docs: Turn Days into Minutes with Automated IntegrationFrom Data to Maps to Docs: Turn Days into Minutes with Automated Integration
From Data to Maps to Docs: Turn Days into Minutes with Automated Integration
Safe Software
 
Big data meet_up_08042016
Big data meet_up_08042016Big data meet_up_08042016
Big data meet_up_08042016
Mark Smith
 
Routing
RoutingRouting
Hadoop
HadoopHadoop
An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce
Sina Ebrahimi
 
Using FME to Automate Data Integration in a City
Using FME to Automate Data Integration in a CityUsing FME to Automate Data Integration in a City
Using FME to Automate Data Integration in a City
Safe Software
 
My Researches in VEOLIA
My Researches in VEOLIAMy Researches in VEOLIA
My Researches in VEOLIA
amanideutsch
 
Approximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processingApproximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processing
Gabriele Modena
 
Transport for London - London's Operations Digital Twin
Transport for London - London's Operations Digital TwinTransport for London - London's Operations Digital Twin
Transport for London - London's Operations Digital Twin
Neo4j
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
Ahmed Gamil
 
Web Mapping 101: What Is It and Making It Work For You
Web Mapping 101: What Is It and Making It Work For YouWeb Mapping 101: What Is It and Making It Work For You
Web Mapping 101: What Is It and Making It Work For You
Safe Software
 
Adding Location and Geospatial Analytics to Big Data Analytics (BDT210) | AWS...
Adding Location and Geospatial Analytics to Big Data Analytics (BDT210) | AWS...Adding Location and Geospatial Analytics to Big Data Analytics (BDT210) | AWS...
Adding Location and Geospatial Analytics to Big Data Analytics (BDT210) | AWS...
Amazon Web Services
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
Abhishek Roy
 
IT - NORWAY
IT - NORWAYIT - NORWAY
IT - NORWAY
gipaiesecftuhcmc
 

Similar to High Performance Computing on NYC Yellow Taxi Data Set (20)

Download It
Download ItDownload It
Download It
 
REX Hadoop et R
REX Hadoop et RREX Hadoop et R
REX Hadoop et R
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
 
Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?
 
Mr bi
Mr biMr bi
Mr bi
 
Big Data Pipelines and Machine Learning at Uber
Big Data Pipelines and Machine Learning at UberBig Data Pipelines and Machine Learning at Uber
Big Data Pipelines and Machine Learning at Uber
 
From Data to Maps to Docs: Turn Days into Minutes with Automated Integration
From Data to Maps to Docs: Turn Days into Minutes with Automated IntegrationFrom Data to Maps to Docs: Turn Days into Minutes with Automated Integration
From Data to Maps to Docs: Turn Days into Minutes with Automated Integration
 
Big data meet_up_08042016
Big data meet_up_08042016Big data meet_up_08042016
Big data meet_up_08042016
 
Routing
RoutingRouting
Routing
 
Hadoop
HadoopHadoop
Hadoop
 
An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce
 
Using FME to Automate Data Integration in a City
Using FME to Automate Data Integration in a CityUsing FME to Automate Data Integration in a City
Using FME to Automate Data Integration in a City
 
My Researches in VEOLIA
My Researches in VEOLIAMy Researches in VEOLIA
My Researches in VEOLIA
 
Approximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processingApproximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processing
 
Transport for London - London's Operations Digital Twin
Transport for London - London's Operations Digital TwinTransport for London - London's Operations Digital Twin
Transport for London - London's Operations Digital Twin
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
 
Web Mapping 101: What Is It and Making It Work For You
Web Mapping 101: What Is It and Making It Work For YouWeb Mapping 101: What Is It and Making It Work For You
Web Mapping 101: What Is It and Making It Work For You
 
Adding Location and Geospatial Analytics to Big Data Analytics (BDT210) | AWS...
Adding Location and Geospatial Analytics to Big Data Analytics (BDT210) | AWS...Adding Location and Geospatial Analytics to Big Data Analytics (BDT210) | AWS...
Adding Location and Geospatial Analytics to Big Data Analytics (BDT210) | AWS...
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
IT - NORWAY
IT - NORWAYIT - NORWAY
IT - NORWAY
 

Recently uploaded

ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 

Recently uploaded (20)

ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 

High Performance Computing on NYC Yellow Taxi Data Set

  • 1. EXTRACTING INSIGHTS FROM BIG DATA: A CASE OF NEW YORK CITY YELLOW TAXI DATASET Parag Ahire January 11, 2020
  • 2. PRESENTATION OUTLINE Brief introduction • Big Data & high performance computing • Describe few techniques for high performance computing  Compare and Contrast few techniques Sample Dataset introduction – NYC Yellow Taxi Apply four techniques • High-level code review • Demonstration on Hortonworks virtual machine on Azure
  • 3. BIG DATA  Data is growing • Digital Age : 2002 onwards • 2019 – 1770 Exabytes • 2020 – 2000 Exabytes  What is it ? • A data set that cannot be processed by a “normal” machine in “reasonable” amount of time • Three V’s  Volume  Velocity  Variety • May vary by time and prevalent technology  Used to be Giga/Tera bytes  Now Exa/Peta bytes  Future Zeta/Yotta bytes o Zeta byte – 1000 data centers occupying 20% of Manhattan o Yotta byte – 1M data centers occupying Delaware and Rhode Island
  • 4. HIGH PERFORMANCE COMPUTING  The ability to process massive amount of data and perform complex calculations at high speed  New Challenges (7 V’s)  Previously - Volume  Now – Velocity, Variety, Variability, Veracity, Visualization, Value  How to perform ? • Supercomputers – expensive, require specialized expertise to use and solve specialized problems • Cluster of small or medium sized business computers • Modern “supercomputers” are mostly “cluster of computers”
  • 5. PARALLEL AND DISTRIBUTED COMPUTING Parallel Computing – All processors have access to shared memory Distributed Computing – Each processor has its own memory. Information is exchanged by passing messages between processors Images taken from : Wikipedia
  • 6. DISTRIBUTED COMPUTING MODELS  Parallel algorithms  Shared-memory model • All processors access shared memory • Programmer decides what program is executed by each processor  Message-passing model • Programmer chooses o Network structure o Program executed by each computer  Distributed algorithms  Programmer chooses the computer program  All computers run the same program
  • 7. HIGH PERFORMANCE COMPUTING TECHNIQUES (HPCT)  Map Reduce  A framework or programming model  Suitable for processing large volume of structured/unstructured data  Pig  Procedural rather than declarative coding approach  Provides a high degree of abstraction for map reduce  Hive  A traditional data warehouse interface for map reduce  Spark  A open source big data framework  A unified analytics engine for large-scale data processing
  • 8. MAP REDUCE  Map function  Input – A Key Value pair • (k1, v1) -> list(k2, v2)  Output – A list of key value pairs (one or more elements)  Reduce function  Input – A Key and a list of values • (k2, list(v2)) -> list(v2)  Sort  Merging and sorting of output produced in the map phase  Shuffle  Transfers intermediate output of map phase to reducer  Passes on intermediate output of one or more keys to a single reducer
  • 9. MAP REDUCE  Concerns  Map phase – done in parallel, typically 20% of the work  Reduce phase – executed sequentially for each key, typically 80% of the work  Tips  Increase with work done in the map phase and leave less for the reduce phase  Include the optional combine phase to reduce work done by the reducer  Combine (Optional)  A mini-reducer to summarizes mapper output record for a single key  Reduces data transfer between mapper and reducer  Decreases the amount of data to be processed by the reducer
  • 10. MAP REDUCE Image taken from : Data Flair Training Blog
  • 12. QUESTION : MAP REDUCE For each unique day in a month across all months in the year 2014 print the maximum total number of passengers across all months (across all eligible trips) alighting (i.e. picking up) a Yellow Taxi between the hours 09:00 am (inclusive) and 10:00 am (exclusive) for a trip distance of less than 3 miles where a tip was paid ? Print the day of the month as a number and the total number of passengers across all eligible trips during the month that was a maximum across all months in a month. The day of the month should be represented as a number between 1 and 31 while considering the maximum number of days occurring in each month of the year 2014. Any trip data that did not have a pickup date between 1st January 2014 and 31st December 2014 should be ignored. The day of the month need not be sorted while printing the output.
  • 13. ANSWER : MAP REDUCE Day:CountOfPassengers 1:25 10:25 11:37 12:53 13:25 14:21 15:23 16:27 17:30 18:45 Day:CountOfPassengers 28:39 29:36 3:32 30:26 31:21 4:34 5:30 6:31 7:36 8:25 9:27 Day:CountOfPassengers 19:38 2:33 20:49 21:39 22:38 23:33 24:38 25:33 26:38 27:44
  • 14. TOP HADOOP VENDORS Amazon Elastic Map Reduce (EMR) Cloudera* CDH Hadoop Distribution Hortonworks* Data Platform (HDP) MapR Hadoop Distribution IBM Open Platform Microsoft Azure HDInsight Pivotal Big Data Suite *Merged
  • 15. PIG Grew out of Yahoo A platform for analyzing large data sets Pig Latin – A procedural language  Provides a sequence of data transformations • To merge, filter, apply functions, group records • Supports User Defined Functions (UDF) for special processing Programs are compiled into map reduce jobs  Support for python, java, groovy, JavaScript, ruby
  • 16. PIG Abstraction for map reduce programming  Improves developer productivity  Suitable for use for data analysts Lower performance than map reduce  Use additional machines in cluster to increase performance Used to perform tasks for  Data Storage  Data Execution  Data Manipulation
  • 17. QUESTION : PIG For all data available for the year 2014 (consider all months), which drop- off location had the maximum total amount collected by credit card for a trip exceeding 1 mile where no toll was paid, a tip was also paid but a standard rate was applied for yellow taxi rides? Any trip data that did not have a drop-off date between 1st January 2014 and 31st December 2014, or does not have a valid month or does not have a valid day of the month should be ignored. Print the drop off location ID (IDS) and the aggregated total amount for the top location.
  • 18. ANSWER : PIG Drop-Off Latitude Drop-Off Longitude Sum Total Amount 40.78508 -73.95587 $65221.65
  • 19. HIVE Developed at Facebook A SQL engine on its own meta store on HDFS  Can be queried though HQL (Hive Query Language) Provides a traditional data ware house interface Hive compiler  Converts hive queries to map reduce programs  Executed in parallel across machines in the Hadoop cluster
  • 20. HIVE  Abstraction for map reduce programming  Improves developer productivity  Suitable for individuals with a SQL background  Lower performance than map reduce  Use additional machines in cluster to increase performance  Supports User Defined Functions (UDF’s)  Used for processing structured data  Data is loaded in tables  Unstructured data needs to be structured  Data is then loaded to tables
  • 21. QUESTION : HIVE Which three pairs of pickup location / drop off location had the largest ratio of total amount paid per passenger for trips taken by a yellow taxi for all data available for the year 2014? Only trips that utilized a payment type of credit card and utilized a standard rate code should be considered. Any trip data that did not have a drop-off date between 1st January 2014 and 31st December 2014, or does not have a valid month or does not have a valid day of the month should be ignored. Print the rank, pickup location, drop off location and the ratio of total amount paid to the passenger count for these three top pairs of pickup / drop off locations. Locations should be printed in descending order of the ratio of total amount paid to the passenger count. The pickup location and drop off location should be printed as a string of the form "latitude:longitude" based on the latitude and longitude of the pick-up location or drop off location. A dense ranking should be performed.
  • 22. ANSWER : HIVE RANK Pickup Pickup Longitude Drop-Off Latitude Drop-Off Longitude Ratio of Total Amount to Total Count 1 40.72941 -73.98386 41.30529 -72.92268 $401.5 2 40.73249 -73.98791 40.72129 -73.95615 $354.25 3 40.67019 -73.91853 40.87084 -73.90391 $354.0
  • 23. COMPARISON – MAP REDUCE, PIG, HIVE MAP REDUCE PIG HIVE Compiled Language Scripting Language Query Language Lower level of abstraction Higher level of abstraction Higher level of abstraction Higher learning curve Lower learning curve Lowest learning curve Best performance for very large data Intermediate performance for very large data(50 % lower) Least performance for very large data Programmer writes more lines of code Programmer writes intermediate lines of code Programmer writes least lines of code Highest code efficiency (more flexibility) Relatively less code efficiency (lesser flexibility) Relatively less code efficiency (lesser flexibility) Possible to handle unstructured data Not very friendly with unstructured data like images Not very friendly with unstructured data like images Possible to deal with poor schema design of xml, json Cannot deal with poor design of xml, json Not easy to deal with poor design of xml, json More potential of introducing defects due to having to write very custom code Limited possibility of introducing defects due to fixed syntactic possibilities Limited possibility of introducing defects due to fixed syntactic possibilities
  • 24. SPARK  Developed at UC Berkeley AMPLab  An open source big data framework  Utilizes DAG (Directed Acyclic Graph) programming style  Now maintained by non-profit Apache Software Foundation  An unified analytics engine for  Large scale processing  Faster, general purpose processing  Reduces read/write operations from/to disk  Intermediate data stored in memory to achieve speed  RDD’s (Resilient Distributed Dataset)  DataFrame  Used to build batch, iterative, interactive, graph and streaming applications
  • 25. SPARK Supports cross-platform development Programming in Scala, Java, Python, R, SQL – Core API’s  PySpark (Python)  SparkR  Spark SQL (fka Shark) Rest of the eco-system  MLLib (Machine Learning)  GraphX (Graph Computation)  Spark Streaming
  • 26. COMPARISON – SPARK, MAP REDUCE CRITERIA SPARK MAP REDUCE Written In Scala Java License Apache 2 Apache 2 OS support Cross-platform Cross platform Programming Languages Scala, Java, Python, R, SQL Java, C, C++, Ruby, Groovy, Python, Perl Lines of Code (LOC) Approximately 20,000 Approximately 120,000 Hardware Requirements Requires the use of mid to high level hardware Runs well on commodity hardware Data Storage Hadoop Distributed File System (HFDS), Google Cloud Storage, Amazon S3, Microsoft Azure Hadoop Distributed File System (HDFS), MapR, HBase Community Strong community, one of the most active projects at Apache MapReduce community has shifted to Spark Scalability Highly scalable, one of the largest cluster has 8K nodes Even higher scalability, one of the largest cluster has 14K nodes
  • 27. COMPARISON – SPARK, MAP REDUCE CRITERIA SPARK MAP REDUCE Speed 100x faster in memory 10x faster on disk Faster than traditional approaches Difficulty / Ease of use Easy to program with the use of high level operators (RDD’s and data frames) Difficult due to the need to program each and every operation Ease of management Easy since it is a single analytics engine that performs various tasks It is a batch engine and needs to be coupled with other engines like Storm, Giraph, Impala etc. to achieve various tasks Fault tolerance No need to start from scratch (except for programming errors) but some limitations due to in memory operations No need to start from scratch (except for programming errors) Data Processing modes Batch, Real Time, Iterative, Interactive, Graph, Streaming Batch API’s and caching Caches data in memory No support for caching SQL Support Support via Spark SQL (fka Shark) Supported via Hive
  • 28. COMPARISON – SPARK, MAP REDUCE CRITERIA SPARK MAP REDUCE Real Time analysis Possible to handle at scale No support for real-time analysis Streaming Spark Streaming handles streaming No support for streaming Interactive mode Supported Not supported Recovery Allows recovery of failed nodes by re- computation of DAG Resilient to system faults or failures. It is highly tolerant system. Latency Low High Scheduler Due to in-memory computations it acts as its own flow-scheduler Requires an external job scheduler like Oozie for its flows Security / Access Permission Less secure since the only mechanism supported is shared secret authentication More secure because of Kerberos and ACL’s (access control lists) Cost Requires plenty of RAM for in-memory computations, so increases costs as cluster size increases It is cheaper in terms of cost Category Choice Choice of data scientists since it is a complete analytics engine Choice of data engineers since it is a basic data processing engine
  • 29. QUESTION : SPARK Which day (or days) across all months in the year 2014 yielded the largest total tip amount (across all eligible trips) as a percentage of the total amount (across all eligible trips) for trips that charged the standard rate on a Yellow Taxi where the total amount for each trip exceeded 5 and no toll was paid? Print the day (or days) of the month (only the day ranging from 1 to 31) in 2014 and the total tip amount as a percentage of the total amount. Utilize the pickup date time for deciding which day of the month that the trip counts against. The drop off datetime need not be considered. Any trip data that did not have a pickup date between 1st January 2014 and 31st December 2014, or does not have a valid month or does not have a valid day of the month should be ignored.
  • 30. ANSWER : SPARK PICK UP DAY OF THE MONTH PERCENTAGE OF SUM TIP AMOUNT TO SUM TOTAL AMOUNT 10 9.991076
  • 31. ALTERNATIVES TECHNIQUE ALTERNATIVE TECHNIQUE Map Reduce Apache Spark Pig Apache Spark Hive Apache Spark, Impala, HAWK, Spark SQL, Shark Pivotal HDB (fka HAWK) PrestoDB (Facebook) BigSQL (IBM) BigQuery (Google) Spark Apache Storm, Flume Cassandra Amazon Kinesis Splunk Elasticsearch Koalas (Databricks) Vaex – python library for lazy Out-Of-Code data frames
  • 32. REFERENCES What is Big Data? Data Center storage capacity worldwide from 2016 to 2021, by segment How big is a Yottabyte? What is High Performance Computing? The 7 V’s of Big Data Distributed Computing Hadoop Combiner – Best Explanation to MapReduce Combiner Pig Documentation UC Berkeley AMPLab NYC TLC Trip Record Data Map Reduce vs Pig vs Hive Spark vs Hadoop MapReduce: Which big data framework to choose Apache Spark vs Hadoop MapReduce – Feature Wise Comparison Spark vs Hadoop MapReduce MapReduce vs Spark – 20 Useful Comparisons To Learn Spark vs Hadoop : Which is the Best Big Data Framework