SlideShare a Scribd company logo
Citi Bike NYC
By,
Sushmanth Sagala
Spark Project – UpX Academy
Project Information
 Domain: Travel
 Technology use: Spark, Spark SQL, Spark MLlib
 Dataset: https://s3.amazonaws.com/tripdata/index.html
 Citi Bike is the nation's largest bike share program, with 10,000 bikes and 600
stations across Manhattan, Brooklyn, Queens and Jersey City. Analyze this data to
find the interesting patterns in the traffic.
Data Description
 Trip Duration (seconds)
 Start Time and Date
 Stop Time and Date
 Start Station Name
 End Station Name
 Station ID
 Station Lat/Long
 Bike ID
 User Type (Customer = 24-hour pass or 3-day pass user; Subscriber = Annual
Member)
 Gender (Zero=unknown; 1=male; 2=female)
 Year of Birth
Business Questions
 Spark Core / Spark SQL
 Which route Citi Bikers ride the most?
 Find the biggest trip and its duration?
 When do they ride?
 How far do they go?
 Which stations are most popular?
 Which days of the week are most rides taken on?
 Spark MLLIB
 Predicting the gender based on patterns in the trip
Data Loading
 Used com.databricks.spark library to load CSV data.
 val bike = sqlContext.read.format("com.databricks.spark.csv").option("header",
"true").option("inferSchema", "true").option("delimiter",
",").load("/user/cloudera/Project/CitiBikeNYC/* - Citi Bike trip data.csv")
 Parsed to rename the columns and register the Dataframe as a Temporary Table.
 val bikeDF = bike.toDF("trip_duration", "start_time", "stop_time", "start_station_id",
"start_station_name", "start_station_latitude", "start_station_longitude",
"end_station_id", "end_station_name", "end_station_latitude",
"end_station_longitude", "bike_id", "user_type", "birth_year", "gender")
 bikeDF.registerTempTable("bike")
Business Questions Answered
 Q1. Which route Citi Bikers ride the most?
 To determine the route Citi Bikers ride the most.
 Grouped the Start and End station combination to find the aggregate count and fetching
the route with highest rides taken on.
 sqlContext.sql("SELECT start_station_id, start_station_name, end_station_id,
end_station_name, COUNT(1) AS cnt FROM bike GROUP BY start_station_id,
start_station_name, end_station_id, end_station_name ORDER BY cnt DESC LIMIT 1")
 Q2. Find the biggest trip and its duration?
 To Determine the biggest trip, analyzed trip duration and distance of each trip.
 Distance of each trip calculated based on distance between station location (Latitude and
Longitude).
 Defined an UDF function calcDist to calculate the distance between locations.
Business Questions Answered
 def calcDist(lat1: Double, lon1: Double,lat2: Double, lon2: Double): Int = {
val AVERAGE_RADIUS_OF_EARTH_KM = 6371
val latDistance = Math.toRadians(lat1 - lat2)
val lngDistance = Math.toRadians(lon1 - lon2)
val sinLat = Math.sin(latDistance / 2)
val sinLng = Math.sin(lngDistance / 2)
val a = sinLat * sinLat + (Math.cos(Math.toRadians(lat1)) *
Math.cos(Math.toRadians(lat2)) * sinLng * sinLng)
val c = 2 * Math.atan2(Math.sqrt(a), Math.sqrt(1 - a))
(AVERAGE_RADIUS_OF_EARTH_KM * c).toInt }
 sqlContext.udf.register("calcDist", calcDist _)
 Ordered by the trip duration and distance to find the biggest trip.
 sqlContext.sql("SELECT start_station_id, start_station_name, end_station_id,
end_station_name, trip_duration, calcDist(start_station_latitude, start_station_longitude,
end_station_latitude, end_station_longitude) AS dist FROM bike ORDER BY trip_duration
DESC, dist DESC LIMIT 1")
Business Questions Answered
 Q3. When do they ride?
 To determine the top 10 most preferred time for rides.
 Grouped the Start time based on Hours and Minutes to aggregate the number of rides
taken at respective time and fetching the top 10 most preferred Start time by riders.
 sqlContext.sql("SELECT date_format(start_time,'H:m'), COUNT(1) as cnt FROM bike
GROUP BY date_format(start_time,'H:m') ORDER BY cnt DESC LIMIT 10")
 Q4. How far do they go?
 To determine the farthest ride ever taken.
 Calculated the distance between Start and End stations based on the UDF function
calcDist created and fetched the ride with longest distance.
 sqlContext.sql("SELECT start_station_id, start_station_name, end_station_id,
end_station_name, trip_duration, calcDist(start_station_latitude, start_station_longitude,
end_station_latitude, end_station_longitude) AS dist FROM bike ORDER BY dist DESC
LIMIT 1")
Business Questions Answered
 Q5. Which stations are most popular?
 To determine the top 10 most popular stations among the riders.
 To get the count aggregations of each station, used union to get the overall station list
irrespective of Start or End station and fetch the top 10 stations.
 var popularStationsDF = sqlContext.sql("SELECT start_station_id as station_id,
start_station_name as station_name, COUNT(1) as cnt FROM bike GROUP BY start_station_id,
start_station_name UNION ALL SELECT end_station_id as station_id, end_station_name as
station_name, COUNT(1) as cnt FROM bike GROUP BY end_station_id, end_station_name")
 popularStationsDF.registerTempTable("popularStations")
 sqlContext.sql("SELECT station_id as station_id,station_name, SUM(cnt) as total FROM
popularStations GROUP BY station_id,station_name ORDER BY total DESC LIMIT 10")
 Q6. Which days of the week are most rides taken on?
 To determine the day of the week when most of the rides are taken.
 Grouping each day of week to aggregate the total rides and fetch the day with highest rides
taken on.
 sqlContext.sql("SELECT date_format(start_time,'E') AS Day, COUNT(1) AS cnt FROM bike
GROUP BY date_format(start_time,'E') ORDER BY cnt DESC LIMIT 1")
Sample output screenshot
Q7. Predicting the gender based on
patterns in the trip
 Building a model using Logistic Regression to predict the gender based on patterns in the trip.
 Converting the bikeDF Dataframe to RDD so as to apply the data to model.
 val bikeRDD = bikeDF.rdd
 To define the Label and Features, extracted the features based on ride patterns.
 val features = bikeRDD.map {
bike => val gender = if (bike.getInt(14) == 1) 0.0 else if (bike.getInt(14) == 2) 1.0 else 2.0
val trip_duration = bike.getInt(0).toDouble
val start_time = bike.getTimestamp(1).getTime.toDouble
val stop_time = bike.getTimestamp(2).getTime.toDouble
val start_station_id = bike.getInt(3).toDouble
val start_station_latitude = bike.getDouble(5)
val start_station_longitude = bike.getDouble(6)
val end_station_id = bike.getInt(7).toDouble
val end_station_latitude = bike.getDouble(9)
val end_station_longitude = bike.getDouble(10)
val user_type = if (bike.getString(12) == "Customer") 1.0 else 2.0
Array(gender, trip_duration, start_time, stop_time, start_station_id, start_station_latitude, start_station_longitude,
end_station_id, end_station_latitude, end_station_longitude, user_type) }
Q7. Predicting the gender based on
patterns in the trip
 Gender is used as label and other relevant columns are used as features to form
the Labeled Vector.
 val labeled = features.map { x => LabeledPoint(x(0), Vectors.dense(x(1), x(2), x(3), x(4),
x(5), x(6), x(7), x(8), x(9), x(10)))}
 Splitting the data for Tests and Training respectively.
 val training = labeled.filter(_.label != 2).randomSplit(Array(0.60, 0.40))(1) val test
= labeled.filter(_.label != 2).randomSplit(Array(0.40, 0.60))(1)
 Running the training algorithm to build the predicting model. Cached training set
is passed to the training algorithm.
 Number of classes is set to 2.
 val model = new LogisticRegressionWithLBFGS().setNumClasses(2).run(training)
Q7. Predicting the gender based on
patterns in the trip
 Computing the predictions on the Testing data.
 val predictionAndLabels = test.map {
case LabeledPoint(label, features) =>
val prediction = model.predict(features)
(prediction, label) }
 To determine the accuracy of the built model.
 Fetching the count of Wrong predictions found based on label and prediction
values.
 With the help of total test count and wrong counts accuracy is determined.
 val wrong = predictionAndLabels.filter {
case (label, prediction) => label != prediction }
 val wrong_count = wrong.count
 val accuracy = 1 - (wrong_count.toDouble / test_count)
Sample output screenshot
Conclusion
 Citi Bike NYC data loaded and analysed successfully.
 CSV data loaded as Dataframe to answer the business questions using Spark SQL.
 Built an Mllib Logistic Regression model to predict the gender of each rider based
on patterns in the trip.
 Code: https://github.com/ssushmanth/CitiBikeNYC-spark

More Related Content

What's hot

Public key Infrastructure (PKI)
Public key Infrastructure (PKI)Public key Infrastructure (PKI)
Public key Infrastructure (PKI)
Venkatesh Jambulingam
 
Berikut lirik lagu deen assalam
Berikut lirik lagu deen assalamBerikut lirik lagu deen assalam
Berikut lirik lagu deen assalam
mohd zaidi
 
Demystifying blockchain
Demystifying blockchain   Demystifying blockchain
Demystifying blockchain
Vinod Kashyap
 
Sennder - NOAH17 Berlin
Sennder - NOAH17 BerlinSennder - NOAH17 Berlin
Sennder - NOAH17 Berlin
NOAH Advisors
 
Applied Cryptography
Applied CryptographyApplied Cryptography
Applied Cryptography
Marcelo Martins
 
Data as a Service (DaaS): The What, Why, How, Who, and When
Data as a Service (DaaS): The What, Why, How, Who, and WhenData as a Service (DaaS): The What, Why, How, Who, and When
Data as a Service (DaaS): The What, Why, How, Who, and When
RocketSource
 
Detecting fraud with Python and machine learning
Detecting fraud with Python and machine learningDetecting fraud with Python and machine learning
Detecting fraud with Python and machine learning
wgyn
 
Topic1 substitution transposition-techniques
Topic1 substitution transposition-techniquesTopic1 substitution transposition-techniques
Topic1 substitution transposition-techniques
MdFazleRabbi18
 
Credit Card Fraud Detection
Credit Card Fraud DetectionCredit Card Fraud Detection
Credit Card Fraud Detection
ijtsrd
 
Real Time Gross Settlement
Real Time Gross SettlementReal Time Gross Settlement
Real Time Gross Settlement
Pratheeban Rajendran
 
Cloud Computing In Banking And Finance Industry
Cloud Computing In Banking And Finance IndustryCloud Computing In Banking And Finance Industry
Cloud Computing In Banking And Finance Industry
Tyrone Systems
 
Payment Gateway by iPay88
Payment Gateway by iPay88Payment Gateway by iPay88
Payment Gateway by iPay88
sitecmy
 
Blockchain Technology
Blockchain TechnologyBlockchain Technology
Blockchain Technology
Mithileysh Sathiyanarayanan
 
Peter Afanasiev - Architecture of online Payments
Peter Afanasiev - Architecture of online PaymentsPeter Afanasiev - Architecture of online Payments
Peter Afanasiev - Architecture of online Payments
Ciklum Ukraine
 
Enterprise Blockchain PowerPoint Presentation Slides
Enterprise Blockchain PowerPoint Presentation SlidesEnterprise Blockchain PowerPoint Presentation Slides
Enterprise Blockchain PowerPoint Presentation Slides
SlideTeam
 
PCI DSS: What it is, and why you should care
PCI DSS: What it is, and why you should carePCI DSS: What it is, and why you should care
PCI DSS: What it is, and why you should care
Sean D. Goodwin
 
Mobile Wallet functions
Mobile Wallet functionsMobile Wallet functions
Mobile Wallet functions
Mikhail Miroshnichenko
 
The digital transformation of CPG and manufacturing
The digital transformation of CPG and manufacturingThe digital transformation of CPG and manufacturing
The digital transformation of CPG and manufacturing
Cloudera, Inc.
 
Topic2 caser hill_cripto
Topic2 caser hill_criptoTopic2 caser hill_cripto
Topic2 caser hill_cripto
MdFazleRabbi18
 

What's hot (20)

Public key Infrastructure (PKI)
Public key Infrastructure (PKI)Public key Infrastructure (PKI)
Public key Infrastructure (PKI)
 
Berikut lirik lagu deen assalam
Berikut lirik lagu deen assalamBerikut lirik lagu deen assalam
Berikut lirik lagu deen assalam
 
IT in Banking v0 2
IT in Banking v0 2IT in Banking v0 2
IT in Banking v0 2
 
Demystifying blockchain
Demystifying blockchain   Demystifying blockchain
Demystifying blockchain
 
Sennder - NOAH17 Berlin
Sennder - NOAH17 BerlinSennder - NOAH17 Berlin
Sennder - NOAH17 Berlin
 
Applied Cryptography
Applied CryptographyApplied Cryptography
Applied Cryptography
 
Data as a Service (DaaS): The What, Why, How, Who, and When
Data as a Service (DaaS): The What, Why, How, Who, and WhenData as a Service (DaaS): The What, Why, How, Who, and When
Data as a Service (DaaS): The What, Why, How, Who, and When
 
Detecting fraud with Python and machine learning
Detecting fraud with Python and machine learningDetecting fraud with Python and machine learning
Detecting fraud with Python and machine learning
 
Topic1 substitution transposition-techniques
Topic1 substitution transposition-techniquesTopic1 substitution transposition-techniques
Topic1 substitution transposition-techniques
 
Credit Card Fraud Detection
Credit Card Fraud DetectionCredit Card Fraud Detection
Credit Card Fraud Detection
 
Real Time Gross Settlement
Real Time Gross SettlementReal Time Gross Settlement
Real Time Gross Settlement
 
Cloud Computing In Banking And Finance Industry
Cloud Computing In Banking And Finance IndustryCloud Computing In Banking And Finance Industry
Cloud Computing In Banking And Finance Industry
 
Payment Gateway by iPay88
Payment Gateway by iPay88Payment Gateway by iPay88
Payment Gateway by iPay88
 
Blockchain Technology
Blockchain TechnologyBlockchain Technology
Blockchain Technology
 
Peter Afanasiev - Architecture of online Payments
Peter Afanasiev - Architecture of online PaymentsPeter Afanasiev - Architecture of online Payments
Peter Afanasiev - Architecture of online Payments
 
Enterprise Blockchain PowerPoint Presentation Slides
Enterprise Blockchain PowerPoint Presentation SlidesEnterprise Blockchain PowerPoint Presentation Slides
Enterprise Blockchain PowerPoint Presentation Slides
 
PCI DSS: What it is, and why you should care
PCI DSS: What it is, and why you should carePCI DSS: What it is, and why you should care
PCI DSS: What it is, and why you should care
 
Mobile Wallet functions
Mobile Wallet functionsMobile Wallet functions
Mobile Wallet functions
 
The digital transformation of CPG and manufacturing
The digital transformation of CPG and manufacturingThe digital transformation of CPG and manufacturing
The digital transformation of CPG and manufacturing
 
Topic2 caser hill_cripto
Topic2 caser hill_criptoTopic2 caser hill_cripto
Topic2 caser hill_cripto
 

Similar to Spark - Citi Bike NYC

Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
François Garillot
 
Deep Dive Into Swift
Deep Dive Into SwiftDeep Dive Into Swift
Deep Dive Into Swift
Sarath C
 
IRJET- Explore the World
IRJET- 	  Explore the WorldIRJET- 	  Explore the World
IRJET- Explore the World
IRJET Journal
 
What are customers building with new Bing Maps capabilities
What are customers building with new Bing Maps capabilitiesWhat are customers building with new Bing Maps capabilities
What are customers building with new Bing Maps capabilities
Microsoft Tech Community
 
Supply chain logistics : vehicle routing and scheduling
Supply chain logistics : vehicle  routing and  schedulingSupply chain logistics : vehicle  routing and  scheduling
Supply chain logistics : vehicle routing and scheduling
Retigence Technologies
 
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB
 
SFBay Area Bike Share Analysis
SFBay Area Bike Share AnalysisSFBay Area Bike Share Analysis
SFBay Area Bike Share Analysis
Swapnil Patil
 
Lunchlezing landelijke keuzemodellen voor Octavius
Lunchlezing landelijke keuzemodellen voor OctaviusLunchlezing landelijke keuzemodellen voor Octavius
Lunchlezing landelijke keuzemodellen voor Octavius
Luuk Brederode
 
Real-Time Analytics at Uber Scale
Real-Time Analytics at Uber ScaleReal-Time Analytics at Uber Scale
Real-Time Analytics at Uber Scale
SingleStore
 
IRJET- Survey on Adaptive Routing Algorithms
IRJET- Survey on Adaptive Routing AlgorithmsIRJET- Survey on Adaptive Routing Algorithms
IRJET- Survey on Adaptive Routing Algorithms
IRJET Journal
 
112 portfpres.pdf
112 portfpres.pdf112 portfpres.pdf
112 portfpres.pdf
sash236
 
Geo CO - RTD Denver
Geo CO - RTD DenverGeo CO - RTD Denver
Geo CO - RTD Denver
Mike Giddens
 
Presentation
PresentationPresentation
Presentation
Vickey Rawat
 
Better Visualization of Trips through Agglomerative Clustering
Better Visualization of  Trips through     Agglomerative ClusteringBetter Visualization of  Trips through     Agglomerative Clustering
Better Visualization of Trips through Agglomerative ClusteringAnbarasan S
 
Case Study: Building Analytic Models over Big Data
Case Study: Building Analytic Models over Big DataCase Study: Building Analytic Models over Big Data
Case Study: Building Analytic Models over Big Data
Collin Bennett
 
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstream
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstreamDemo: tablet-based visualisation of transport data in Madrid using SPARQLstream
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstream
PlanetData Network of Excellence
 
SSN2013 Demo: tablet based visualization of transport data with SPARQLStream
SSN2013 Demo: tablet based visualization of transport data with SPARQLStreamSSN2013 Demo: tablet based visualization of transport data with SPARQLStream
SSN2013 Demo: tablet based visualization of transport data with SPARQLStreamJean-Paul Calbimonte
 
Sample document
Sample documentSample document
Sample document
arunsethu87
 
Geo distance search with my sql presentation
Geo distance search with my sql presentationGeo distance search with my sql presentation
Geo distance search with my sql presentationGSMboy
 

Similar to Spark - Citi Bike NYC (20)

Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
 
Deep Dive Into Swift
Deep Dive Into SwiftDeep Dive Into Swift
Deep Dive Into Swift
 
IRJET- Explore the World
IRJET- 	  Explore the WorldIRJET- 	  Explore the World
IRJET- Explore the World
 
What are customers building with new Bing Maps capabilities
What are customers building with new Bing Maps capabilitiesWhat are customers building with new Bing Maps capabilities
What are customers building with new Bing Maps capabilities
 
Supply chain logistics : vehicle routing and scheduling
Supply chain logistics : vehicle  routing and  schedulingSupply chain logistics : vehicle  routing and  scheduling
Supply chain logistics : vehicle routing and scheduling
 
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
 
SFBay Area Bike Share Analysis
SFBay Area Bike Share AnalysisSFBay Area Bike Share Analysis
SFBay Area Bike Share Analysis
 
Lunchlezing landelijke keuzemodellen voor Octavius
Lunchlezing landelijke keuzemodellen voor OctaviusLunchlezing landelijke keuzemodellen voor Octavius
Lunchlezing landelijke keuzemodellen voor Octavius
 
Real-Time Analytics at Uber Scale
Real-Time Analytics at Uber ScaleReal-Time Analytics at Uber Scale
Real-Time Analytics at Uber Scale
 
IRJET- Survey on Adaptive Routing Algorithms
IRJET- Survey on Adaptive Routing AlgorithmsIRJET- Survey on Adaptive Routing Algorithms
IRJET- Survey on Adaptive Routing Algorithms
 
112 portfpres.pdf
112 portfpres.pdf112 portfpres.pdf
112 portfpres.pdf
 
Geo CO - RTD Denver
Geo CO - RTD DenverGeo CO - RTD Denver
Geo CO - RTD Denver
 
Presentation
PresentationPresentation
Presentation
 
Better Visualization of Trips through Agglomerative Clustering
Better Visualization of  Trips through     Agglomerative ClusteringBetter Visualization of  Trips through     Agglomerative Clustering
Better Visualization of Trips through Agglomerative Clustering
 
Case Study: Building Analytic Models over Big Data
Case Study: Building Analytic Models over Big DataCase Study: Building Analytic Models over Big Data
Case Study: Building Analytic Models over Big Data
 
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstream
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstreamDemo: tablet-based visualisation of transport data in Madrid using SPARQLstream
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstream
 
SSN2013 Demo: tablet based visualization of transport data with SPARQLStream
SSN2013 Demo: tablet based visualization of transport data with SPARQLStreamSSN2013 Demo: tablet based visualization of transport data with SPARQLStream
SSN2013 Demo: tablet based visualization of transport data with SPARQLStream
 
Sample document
Sample documentSample document
Sample document
 
Geo distance search with my sql presentation
Geo distance search with my sql presentationGeo distance search with my sql presentation
Geo distance search with my sql presentation
 

Recently uploaded

The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
ViralQR
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 

Recently uploaded (20)

The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 

Spark - Citi Bike NYC

  • 1. Citi Bike NYC By, Sushmanth Sagala Spark Project – UpX Academy
  • 2. Project Information  Domain: Travel  Technology use: Spark, Spark SQL, Spark MLlib  Dataset: https://s3.amazonaws.com/tripdata/index.html  Citi Bike is the nation's largest bike share program, with 10,000 bikes and 600 stations across Manhattan, Brooklyn, Queens and Jersey City. Analyze this data to find the interesting patterns in the traffic.
  • 3. Data Description  Trip Duration (seconds)  Start Time and Date  Stop Time and Date  Start Station Name  End Station Name  Station ID  Station Lat/Long  Bike ID  User Type (Customer = 24-hour pass or 3-day pass user; Subscriber = Annual Member)  Gender (Zero=unknown; 1=male; 2=female)  Year of Birth
  • 4. Business Questions  Spark Core / Spark SQL  Which route Citi Bikers ride the most?  Find the biggest trip and its duration?  When do they ride?  How far do they go?  Which stations are most popular?  Which days of the week are most rides taken on?  Spark MLLIB  Predicting the gender based on patterns in the trip
  • 5. Data Loading  Used com.databricks.spark library to load CSV data.  val bike = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").option("delimiter", ",").load("/user/cloudera/Project/CitiBikeNYC/* - Citi Bike trip data.csv")  Parsed to rename the columns and register the Dataframe as a Temporary Table.  val bikeDF = bike.toDF("trip_duration", "start_time", "stop_time", "start_station_id", "start_station_name", "start_station_latitude", "start_station_longitude", "end_station_id", "end_station_name", "end_station_latitude", "end_station_longitude", "bike_id", "user_type", "birth_year", "gender")  bikeDF.registerTempTable("bike")
  • 6. Business Questions Answered  Q1. Which route Citi Bikers ride the most?  To determine the route Citi Bikers ride the most.  Grouped the Start and End station combination to find the aggregate count and fetching the route with highest rides taken on.  sqlContext.sql("SELECT start_station_id, start_station_name, end_station_id, end_station_name, COUNT(1) AS cnt FROM bike GROUP BY start_station_id, start_station_name, end_station_id, end_station_name ORDER BY cnt DESC LIMIT 1")  Q2. Find the biggest trip and its duration?  To Determine the biggest trip, analyzed trip duration and distance of each trip.  Distance of each trip calculated based on distance between station location (Latitude and Longitude).  Defined an UDF function calcDist to calculate the distance between locations.
  • 7. Business Questions Answered  def calcDist(lat1: Double, lon1: Double,lat2: Double, lon2: Double): Int = { val AVERAGE_RADIUS_OF_EARTH_KM = 6371 val latDistance = Math.toRadians(lat1 - lat2) val lngDistance = Math.toRadians(lon1 - lon2) val sinLat = Math.sin(latDistance / 2) val sinLng = Math.sin(lngDistance / 2) val a = sinLat * sinLat + (Math.cos(Math.toRadians(lat1)) * Math.cos(Math.toRadians(lat2)) * sinLng * sinLng) val c = 2 * Math.atan2(Math.sqrt(a), Math.sqrt(1 - a)) (AVERAGE_RADIUS_OF_EARTH_KM * c).toInt }  sqlContext.udf.register("calcDist", calcDist _)  Ordered by the trip duration and distance to find the biggest trip.  sqlContext.sql("SELECT start_station_id, start_station_name, end_station_id, end_station_name, trip_duration, calcDist(start_station_latitude, start_station_longitude, end_station_latitude, end_station_longitude) AS dist FROM bike ORDER BY trip_duration DESC, dist DESC LIMIT 1")
  • 8. Business Questions Answered  Q3. When do they ride?  To determine the top 10 most preferred time for rides.  Grouped the Start time based on Hours and Minutes to aggregate the number of rides taken at respective time and fetching the top 10 most preferred Start time by riders.  sqlContext.sql("SELECT date_format(start_time,'H:m'), COUNT(1) as cnt FROM bike GROUP BY date_format(start_time,'H:m') ORDER BY cnt DESC LIMIT 10")  Q4. How far do they go?  To determine the farthest ride ever taken.  Calculated the distance between Start and End stations based on the UDF function calcDist created and fetched the ride with longest distance.  sqlContext.sql("SELECT start_station_id, start_station_name, end_station_id, end_station_name, trip_duration, calcDist(start_station_latitude, start_station_longitude, end_station_latitude, end_station_longitude) AS dist FROM bike ORDER BY dist DESC LIMIT 1")
  • 9. Business Questions Answered  Q5. Which stations are most popular?  To determine the top 10 most popular stations among the riders.  To get the count aggregations of each station, used union to get the overall station list irrespective of Start or End station and fetch the top 10 stations.  var popularStationsDF = sqlContext.sql("SELECT start_station_id as station_id, start_station_name as station_name, COUNT(1) as cnt FROM bike GROUP BY start_station_id, start_station_name UNION ALL SELECT end_station_id as station_id, end_station_name as station_name, COUNT(1) as cnt FROM bike GROUP BY end_station_id, end_station_name")  popularStationsDF.registerTempTable("popularStations")  sqlContext.sql("SELECT station_id as station_id,station_name, SUM(cnt) as total FROM popularStations GROUP BY station_id,station_name ORDER BY total DESC LIMIT 10")  Q6. Which days of the week are most rides taken on?  To determine the day of the week when most of the rides are taken.  Grouping each day of week to aggregate the total rides and fetch the day with highest rides taken on.  sqlContext.sql("SELECT date_format(start_time,'E') AS Day, COUNT(1) AS cnt FROM bike GROUP BY date_format(start_time,'E') ORDER BY cnt DESC LIMIT 1")
  • 11. Q7. Predicting the gender based on patterns in the trip  Building a model using Logistic Regression to predict the gender based on patterns in the trip.  Converting the bikeDF Dataframe to RDD so as to apply the data to model.  val bikeRDD = bikeDF.rdd  To define the Label and Features, extracted the features based on ride patterns.  val features = bikeRDD.map { bike => val gender = if (bike.getInt(14) == 1) 0.0 else if (bike.getInt(14) == 2) 1.0 else 2.0 val trip_duration = bike.getInt(0).toDouble val start_time = bike.getTimestamp(1).getTime.toDouble val stop_time = bike.getTimestamp(2).getTime.toDouble val start_station_id = bike.getInt(3).toDouble val start_station_latitude = bike.getDouble(5) val start_station_longitude = bike.getDouble(6) val end_station_id = bike.getInt(7).toDouble val end_station_latitude = bike.getDouble(9) val end_station_longitude = bike.getDouble(10) val user_type = if (bike.getString(12) == "Customer") 1.0 else 2.0 Array(gender, trip_duration, start_time, stop_time, start_station_id, start_station_latitude, start_station_longitude, end_station_id, end_station_latitude, end_station_longitude, user_type) }
  • 12. Q7. Predicting the gender based on patterns in the trip  Gender is used as label and other relevant columns are used as features to form the Labeled Vector.  val labeled = features.map { x => LabeledPoint(x(0), Vectors.dense(x(1), x(2), x(3), x(4), x(5), x(6), x(7), x(8), x(9), x(10)))}  Splitting the data for Tests and Training respectively.  val training = labeled.filter(_.label != 2).randomSplit(Array(0.60, 0.40))(1) val test = labeled.filter(_.label != 2).randomSplit(Array(0.40, 0.60))(1)  Running the training algorithm to build the predicting model. Cached training set is passed to the training algorithm.  Number of classes is set to 2.  val model = new LogisticRegressionWithLBFGS().setNumClasses(2).run(training)
  • 13. Q7. Predicting the gender based on patterns in the trip  Computing the predictions on the Testing data.  val predictionAndLabels = test.map { case LabeledPoint(label, features) => val prediction = model.predict(features) (prediction, label) }  To determine the accuracy of the built model.  Fetching the count of Wrong predictions found based on label and prediction values.  With the help of total test count and wrong counts accuracy is determined.  val wrong = predictionAndLabels.filter { case (label, prediction) => label != prediction }  val wrong_count = wrong.count  val accuracy = 1 - (wrong_count.toDouble / test_count)
  • 15. Conclusion  Citi Bike NYC data loaded and analysed successfully.  CSV data loaded as Dataframe to answer the business questions using Spark SQL.  Built an Mllib Logistic Regression model to predict the gender of each rider based on patterns in the trip.  Code: https://github.com/ssushmanth/CitiBikeNYC-spark