Learn big data with Uber

©2017 Cloudreach
Learn Big Data with Uber
Presented by Mark Thebault | 19/01/2018

©2017 Cloudreach
Agenda
Big Data, real use case
How to get an Uber ?
Use case Uber
IoT, Handle high throughput
Process data, real time, batch
Let’s build a Big Data Platform

©2017 Cloudreach 3
Let’s talk about Uber

©2017 Cloudreach
● 8 Million users
● 160k drivers
● 400 cities around the world
● 1 million rides per day
● 2 billion rides recorded
Big data, real use case 4

©2017 Cloudreach
Dozens of cities opened each year
Big data, real use case 5

©2017 Cloudreach Big data, real use case 6
Drivers location
Courses requests
Payments
User feedback
Metrics...
Driver position every 4 seconds
Riders requests
Million requests per seconds

The architecture
● Handle throughput variation
● Fraud detection
● Low latency
● Scalability

©2017 Cloudreach 8
IoT - Imbibed of Tequila

IoT, How to Handle data?
Uber use case:
● Different sources / Destination
● Different throughput
● Different format
● Several consumers for the same data

How to have good performances ?
● Distributed messaging Platform
● Created in 2009 by LinkedIn
● Known for its
○ Performances
○ Scalability
● Target: Centralise all data exchange

Clean the mess: Centralize data!

Why Kafka ?
● Horizontally scalable by adding new servers
● Enable very high throughput: Allow real time data flows
● Better than traditional brokers RabbitMQ, ActiveMQ
○ Less expensive
○ Better performance
○ Easily scalable

How it works ?
● Publisher / subscriber model
● Message are sent in Topics
● Producers inject data
● Consumers read data with a given offset
● A node is called a Broker

Stream data with Kafka
● Messages are stored on the hard drive
● All writings are executed in RAM
● By default a message is stored 7 days
● Topics are replicated into partitions
● Messages are stored in a Log file
○ Messages have an offset
○ Consumer handles its own offset
(Better read performances)

Kafka topic anatomy
● Topic are split into partitions
● Partition are for fault tolerance
● Partitions have a leader server
and zero or more follower
servers
● Writes are handled by the leader
● Reads can be handled by the
followers

©2017 Cloudreach 16
How to get an Uber ?

NoSQL Databases
Uber’s database use case: Read user data, read driver’s positions…
● Fast response time, high throughput
● Your application scales, your database need also to
● Be always available - No downtime
● Storing large amount of data
● Reduce the price of using RDBMS for the same purpose

Apache Cassandra
● Initially developed by Facebook in 2008
● Column oriented by default tables use schemas
● Built to be deployed in very large scale across different data-centers
● Values are identified by a unique key
○ RowKey (unique ID)
○ Column Name
○ Column Value
○ Default timestamp created by Cassandra
○ Expiration date (optional)

Cassandra Behind the scenes
● Peer-to-Peer (P2P)
○ No master, no slaves
● Multi-Datacenter
○ Geographical distribution
○ Segregation operational / analytic
● Gossip protocol
○ Once per second
○ Exchange cluster informations
○ With at least 3 randomly chosen nodes

Cassandra Data Model
● Columns with defined Schema
● Partition key
○ Same Pkeys on one node
○ Choose wisely the partition key
performances depends on it !
● Clustering key
○ Get an extract of columns
○ Used in WHERE clauses
● Static columns
○ Values shared across all lines

Cassandra CQL Language

Cassandra Consistency Management
● Customisable consistency
○ Writing: customise number of acknowledgments
○ Reading: number of reads done
● Levels
○ ONE
○ QUORUM
○ ALL…
● Consistency level defined at level query

Process data, real time, batch

What the data is telling ?
Real Time processing
● Create a multiplication factor when there is an event (Football party…)
● Find the best customer to fit in one uberpool
Batch processing
● Get daily analytics
● Find fraud drivers
Problem: Gigabytes per second of data, teras of data process each day

Distributed processing
● Involve large number of computer system
○ Computers in a same area network
○ Large bandwidth is required
● Parallel processing
○ Split the processing in different tasks
○ Each computer does its own calculation
○ Results are merged
● Not all processing are suitable to be parallelized, be aware when you model them

What is Spark ?
● Open source framework (Apache)
● Processing of large volumes of data
● Faster than Hadoop MapReduce
● Distributed processing framework
● Main Focuses
○ Streaming
○ Machine Learning
○ Extract Transform Load (ETL)

Spark Architecture
Cluster Management
● Spark Standalone
● Mesos
● Yarn
● Kubernetes (beta)

What is MapReduce ?
● Pattern invented in 2004 by Google
● A dataset is split into partitions:
○ Map: applies a transformation
○ Reduce: aggregates the partitions
● Items are distributed across the network: distributed processing
● Hadoop MapReduce is an implementation of this pattern

What is MapReduce ? - Word count
To be, or not to be, that is the Question:
Whether 'tis Nobler in the minde to suffer
The Slings and Arrowes of outragious Fortune,
Or to take Armes against a Sea of troubles,
And by opposing end them: to dye, to sleepe
No more; and by a sleepe, to say we end
The Heart-ake, and the thousand Naturall shockes
That Flesh is heyre too? 'Tis a consummation
Deuoutly to be wish'd. To dye to sleepe,
To sleepe, perchance to Dreame; I, there's the rub,
For in that sleepe of death, what dreames may come,
When we haue shuffel'd off this mortall coile,
Must giue vs pawse. There's the respect
That makes Calamity of so long life
[to, 1]
[be, 1]
[or, 1]
[not, 1]
[to, 1]
…
[and, 1]
[by, 1]
[opposing, 1]
[end, 1]
…
[that, 1]
[flesh, 1]
[is, 1]
[heyre, 1]
…
[for, 1]
[in, 1]
[that, 1]
[sleepe, 1]
[of, 1]
…
[to, 4]
[be, 2]
[or, 2]
[not, 1]
…
[to, 3]
…
[to, 5]
[be, 1]
…
…
[to, 7]
[be, 2]
[or, 2]
[not, 1]
…
[to, 5]
[be, 1]
…
[to, 12]
[be, 3]
[or, 2]
[not, 1]
…
Partition
Map Reduce Reduce Reduce
Node1Node2

Let’s build Cablito

Cablito - Personal Data
User Personal information
Drives History
Drivers position (Per area)

Cablito - Event Processing
Booking requests
Rides Info
Logs
IN: Booking requests
OUT: Booking Accepts
Logs
Rides Infos
Request
Machine Learning Models
for Fraud detection

Cablito - Analytics
Real time Metrics
Historical Data
Aggregated data
BigBoss
Crazy Data Scientist
Building ML Models
Aggregating data...

Learn big data with Uber

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Learn big data with Uber

Similar to Learn big data with Uber (20)

Recently uploaded

Recently uploaded (20)

Learn big data with Uber