SlideShare a Scribd company logo
1 of 20
Big Data At
United Airlines
Joe Olson
Senior Manager, Big Data Analytics
DataWorks Summit San Jose - June 2018
Agenda
Data Landscape at United
Current Big Data Analytics Environment
Target Big Data Analytics Environment
A Few Big Data Analytics Use Cases
2
About United Airlines…..
 ~ 750 aircraft, with 250+ on order (supply chain)
 148M passengers in 2017
(public facing web site, mobile app, time / geospatial based inventory, loyalty program, surveys, ancillary sales)
 4500 daily departures (scheduling, operations, weather, route planning)
 338 airports served, in 49 countries (baggage claim, check-ins)
 86,000 employees (scheduling, pay)
 Constantly in motion! Future (and past) always changing.
 A data scientist / data engineer dream.
Source: https://hub.united.com/corporate-fact-sheet/
3
Goals Of The Enterprise Analytics Platform
 Improve Customer Experience
- How can we reduce friction when booking a reservation? Maneuvering through an airport?
- How can we deliver a consistent message across all channels? (mobile app, web site, social media etc)
 Improve Employee Experience
- How can we keep employees better informed of the current situation so they can relay it to the customers?
- What are we learning from our surveys about what the customer bases says is / isn’t working?
 Revenue Generation
- What personalized offers can we make to our customers?
- Are our offers competitive with the rest of the industry?
 Improve Operational Reliability
- How can we better prepare for weather or other operational interruptions?
- How can we manage the fleet better and insure spare parts are where they need to be?
4
Industry Ideas – Customer Experience
5
Current Analytics Environment
 Two Main Data Warehouse Platforms
- Teradata – mature data platform, in place for 20+ years. Dedicated team of 25+ people.
ACID compliance allowing for updates. Most ETL here tightly coupled with platform.
- Hortonworks Platform – emerging technology. Economical data science. Data lake
friendly. Community and support frameworks changing faster than more mature Teradata. Log
parsing. Unstructured data and streaming message friendly. Schema-on-read.
- How to get these to play together nicely?
 Enterprise Analytics Team Skills
- Very comfortable with SQL – jobs and dash boarding.
- Not so comfortable with parallel processing and APIs.
- Dependency on Hive.
6
Current Analytics Environment
Systems of Record:
- Bookings
- Operations
- Customer / Loyalty
- Supply Chain
- Logs
(merch, seat browsing, etc)
ETL
Systems of Truth:
ETL
7
Challenge #1 – Data Analytics / Science Where The Data Ain’t
 Bookings & flight schedule constantly in motion – all captured in real time in Teradata
- New state = current state + change
 24 hr lagging snapshot refreshes for data science?
- Teradata not optimized for “give me what changed yesterday” – especially in <k,v> situations.
- Extra bookkeeping TD side to enable offload for data science?
 Straight to the source into data lake?
- ACID tables Hortonworks side? Write optimization compromises read.
- Updates not be able to keep up with stream – Hive concurrency model
- Stream to raw, batch process after lands on disk? Introduces latency.
 Pass though queries?
- Still uses Teradata resources – Spool space.
8
Challenge #1A – Structuring Data Big Data Side
 Bookings & flight schedule – mature relational model with (heavy) secondary indexing
- Needs to be queried from multiple directions
- LLAP cache of bookings and flight schedule? Enough space in RAM?
- De-normalized data model
• Not practical in a lot of cases.
- Partitioning, bucketing, ACID.
• Hive concurrency model read blocks write and write blocks read. Complicates job
scheduling.
9
So What’s Working?
 Data sync Teradata -> Hive – QueryGrid (Teradata)
- Pass through queries vs data replication
- For replication, 4 – 5 patterns practical:
• ‘Small’ data sets
• ‘Large’ data sets where new data is append only and immutable
(Think appending yesterday on a as a new partition)
• ‘Large’ data sets where new data changes ‘small’ number of existing partitions
(Think yesterday’s changes can affect data going back a full year)
- Works even better if full year is partitioned by month, rather than by day. (create new)
• ‘Large’ data sets accessed in a <k,v> manner. (ACID)
- May need to re-partition a bucketed data set to allow time series queries
10
Analytics Environment
Systems of Record:
- Bookings
- Operations
- Customer / Loyalty
- Supply Chain
- Logs
(merch, seat browsing, etc)
ETL
Systems of Truth:
ETL
QG Option #1 replicate data
Queries served using
only HDP resources
11
Analytics Environment
Systems of Record:
- Bookings
- Operations
- Customer / Loyalty
- Supply Chain
- Logs
(merch, seat browsing, etc)
ETL
Systems of Truth:
ETL
QG Option #2 database link
Queries served using
Teradata resources
12
So What’s Working?
 Longer Term - Platform Independent ETL - Nifi
- Nifi – stateless streaming, and stateful streaming where latency can be tolerated.
• Append only to disk + consolidation job
- Common ingestion layer
- Need connectors from operational systems. Not always easy due to ‘operations’
Option to buffer here, or run
compaction job external to Nifi
Cosmetic enrichment.
or
Can also be replaced with a custom (k,v) parser
13
So What’s NOT Working (yet)?
 Data sync Teradata -> Hive – QueryGrid (Teradata)
- ‘Large’ data sets where new data changes ‘large’ number of existing partitions.
- Leveraging QG’s pass-through query abilities here.
 Platform Independent ETL
- Streaming stateful messages
• Customized C++ code / Teradata
• Hortonworks Data Flow, Apache Apex, Apache Flink, Nifi + Hbase, Spark micro batching.
- Enterprise message bus - issues
• Not designed with analytics in mind
• No schema registry
14
Target Architecture – Other Considerations
 Security
- Common Security strategy with Teradata - GDPR
• Groups defined in Active Directory based on access needs, user assigned to them.
• Groups and users replicated to Teradata and Apache Ranger
• Database roles / permissions defined and reviewed on each platform
 Governance
- Looking for a (reasonably priced) solution covering both platforms.
- Apache Atlas – Traceability through Hive, Nifi, HDFS, and Spark (soon) is encouraging.
- May have to resort to custom development using APIs
15
State
Store
Target Architecture Data Lake / Curated Layer
15
Batch
sources
FTP, SCP
Enterprise Message Bus
(JMS sources: Apache Kafka, IBM MQ Series, Tibco EMS)
Data Lake
Hortonworks (ORC on HDFS)
7
Stateless / Stateful High Latency Tolerant
Common Ingestion Layer
Stateful, Low Latency
Ingestion Layer
Curated Layer
Teradata, Hortonworks
Spark ETL
Apache Nifi
Advanced Analytics / ML /
Data Science
Analytics / KPI Dashboards
SQL Spark, SAS, R, etc
16
Analytics Environment
Systems of Record:
- Logs
- Operations
- Customer / Loyalty
- Supply Chain
- Bookings
Systems of Truth:
Batch
sources
FTP, SCP
Enterprise
Message
Bus
Stateless / Stateful
High Latency
Tolerant Ingestion
Layer
Stateful, Low
Latency
Ingestion Layer
Platform Independent ETL
???
Raw Data Lake
Curated Layer
Flight
Narrative
Trip
Narrative
Active
Trip
Narrative
History
17
Use Case: Flight Narrative
LAX – ORD UA 2032 06/11/18 11:00pm
Added to schedule
Aircraft assigned (737-800) #0523
Equipment change 737-800 #0215
Seat reaccomodation (click to see impact)
Crew schedule finalized
Gate assignment B22
Departure change 11:22pm (Late Inbound Crew)
MRD released
Boarding begins
Catering
Boarding ends
Last bag scanned
Out/Off/Taxi
On/In/Taxi
Bags delivered to claim
All events that can be tied to a unique flight are
stored in a time series JSON objects
<T, E, [<k,v>,<k,v>…]>
Inflight Stats
Altitude
Temperature
Wind
Fuel
Catering
Catering Arrival Time
Catering Inventory
Catering Sign off time
Crew List
Pilot
Flight Attendants
02/01/18 – 1:00pm
05/01/18 – 2:30pm
06/02/18 – 10:15am
06/02/18 – 10:20am
06/09/18 – 11:20am
06/10/18 – 9:00pm
06/11/18 – 5:00 pm
06/11/18 – 8:00 pm
06/11/18 – 11:00pm
06/11/18 – 11:25pm
06/11/18 – 11:27pm
06/11/18 – 11:28pm
06/11/18 – 11:32pm
06/12/18 – 5:30am
06/12/18 – 6:05am
Bag Data
Gate Checked Bags (Predicted/Actual)
Bulkhead Timeout
# of Checked Bags
First/Last Bag Scanned on board
First/Last Bag Scanned to baggage claim
18
Ticket Issued
Schedule Change
Itinerary Change
Ancillary Purchase Return to Blocks
Denied Boarding
Bag Delivered to Claim
Rebooked on OA
Cleared Standby
In/Out/On/Off
Upgrade Cleared
Flight Status Notification Sent
Mis-connect
Staisfaction Survey Submitted
Bag File Opened
Pre-Travel Day-of-Travel Post-Travel
• Trip Narrative is a chronological collection of events that define a customer’s experience:
Flight Delayed / Cancelled
Use Case: Trip Narrative
Q & A
We’re hiring!
- Data Engineers
- Data Scientists

More Related Content

What's hot

Success Secret: Southwest airlines
Success Secret: Southwest airlinesSuccess Secret: Southwest airlines
Success Secret: Southwest airlinesHarsh Narula
 
Airborne Express HBS case 798-070
Airborne Express HBS case 798-070 Airborne Express HBS case 798-070
Airborne Express HBS case 798-070 Maximilian Tempel
 
Southwest Airlines
Southwest AirlinesSouthwest Airlines
Southwest AirlinesAnkit Uttam
 
Southwest Airlines case analysis presentation (designing work organization - ...
Southwest Airlines case analysis presentation (designing work organization - ...Southwest Airlines case analysis presentation (designing work organization - ...
Southwest Airlines case analysis presentation (designing work organization - ...Aditya Kumar Varshney
 
AIR DECCAN –Revolutionizing the Indian Skies. Case Study.
AIR DECCAN –Revolutionizing the Indian Skies. Case Study.AIR DECCAN –Revolutionizing the Indian Skies. Case Study.
AIR DECCAN –Revolutionizing the Indian Skies. Case Study.Kaziranga University.
 
Postgre sql best_practices
Postgre sql best_practicesPostgre sql best_practices
Postgre sql best_practicesJacques Kostic
 
International marketing: Emirates Airlines case study
International marketing: Emirates Airlines case studyInternational marketing: Emirates Airlines case study
International marketing: Emirates Airlines case studyTamim Bin Shafique
 
BP Case Final-Team Beta
BP Case Final-Team BetaBP Case Final-Team Beta
BP Case Final-Team BetaChuong Nguyen
 
Airborne express case q&a
Airborne express case q&aAirborne express case q&a
Airborne express case q&aUtsav Mone
 
Ryanair - Corporate Financial Modelling
Ryanair - Corporate Financial ModellingRyanair - Corporate Financial Modelling
Ryanair - Corporate Financial ModellingAlberto Fasulo
 
AWS Webcast - Introduction to Amazon Kinesis
AWS Webcast - Introduction to Amazon KinesisAWS Webcast - Introduction to Amazon Kinesis
AWS Webcast - Introduction to Amazon KinesisAmazon Web Services
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkDatabricks
 
indigo value chain ppt
indigo value chain pptindigo value chain ppt
indigo value chain pptWASEEM KHAN
 
RPG group growth strategy- Keerthan G
RPG group growth strategy- Keerthan GRPG group growth strategy- Keerthan G
RPG group growth strategy- Keerthan GKeerthan G
 
The Applications of Big Data Analytics in the Airlines Industry
The Applications of Big Data Analytics in the Airlines IndustryThe Applications of Big Data Analytics in the Airlines Industry
The Applications of Big Data Analytics in the Airlines IndustryPromptCloud
 

What's hot (20)

Success Secret: Southwest airlines
Success Secret: Southwest airlinesSuccess Secret: Southwest airlines
Success Secret: Southwest airlines
 
NetJets Overview Presentation
NetJets Overview PresentationNetJets Overview Presentation
NetJets Overview Presentation
 
Case study on Jet Airways
Case study on Jet AirwaysCase study on Jet Airways
Case study on Jet Airways
 
Airborne Express HBS case 798-070
Airborne Express HBS case 798-070 Airborne Express HBS case 798-070
Airborne Express HBS case 798-070
 
Southwest Airlines
Southwest AirlinesSouthwest Airlines
Southwest Airlines
 
Southwest Airlines case analysis presentation (designing work organization - ...
Southwest Airlines case analysis presentation (designing work organization - ...Southwest Airlines case analysis presentation (designing work organization - ...
Southwest Airlines case analysis presentation (designing work organization - ...
 
British airways
British airwaysBritish airways
British airways
 
AIR DECCAN –Revolutionizing the Indian Skies. Case Study.
AIR DECCAN –Revolutionizing the Indian Skies. Case Study.AIR DECCAN –Revolutionizing the Indian Skies. Case Study.
AIR DECCAN –Revolutionizing the Indian Skies. Case Study.
 
Postgre sql best_practices
Postgre sql best_practicesPostgre sql best_practices
Postgre sql best_practices
 
International marketing: Emirates Airlines case study
International marketing: Emirates Airlines case studyInternational marketing: Emirates Airlines case study
International marketing: Emirates Airlines case study
 
BP Case Final-Team Beta
BP Case Final-Team BetaBP Case Final-Team Beta
BP Case Final-Team Beta
 
Airborne express case q&a
Airborne express case q&aAirborne express case q&a
Airborne express case q&a
 
Ryanair - Corporate Financial Modelling
Ryanair - Corporate Financial ModellingRyanair - Corporate Financial Modelling
Ryanair - Corporate Financial Modelling
 
Southwest
SouthwestSouthwest
Southwest
 
AWS Webcast - Introduction to Amazon Kinesis
AWS Webcast - Introduction to Amazon KinesisAWS Webcast - Introduction to Amazon Kinesis
AWS Webcast - Introduction to Amazon Kinesis
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
indigo value chain ppt
indigo value chain pptindigo value chain ppt
indigo value chain ppt
 
RPG group growth strategy- Keerthan G
RPG group growth strategy- Keerthan GRPG group growth strategy- Keerthan G
RPG group growth strategy- Keerthan G
 
Henry Tam and the MGI Team
Henry Tam and the MGI TeamHenry Tam and the MGI Team
Henry Tam and the MGI Team
 
The Applications of Big Data Analytics in the Airlines Industry
The Applications of Big Data Analytics in the Airlines IndustryThe Applications of Big Data Analytics in the Airlines Industry
The Applications of Big Data Analytics in the Airlines Industry
 

Similar to Big data at United Airlines

EMC Isilon Database Converged deck
EMC Isilon Database Converged deckEMC Isilon Database Converged deck
EMC Isilon Database Converged deckKeithETD_CTO
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSPhilip Filleul
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeongYousun Jeong
 
Real-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsReal-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsVMware Tanzu
 
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...Facultad de Informática UCM
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Rain stor isilon_emc_real_Examine the Real Cost of Storing & Analyzing Your M...
Rain stor isilon_emc_real_Examine the Real Cost of Storing & Analyzing Your M...Rain stor isilon_emc_real_Examine the Real Cost of Storing & Analyzing Your M...
Rain stor isilon_emc_real_Examine the Real Cost of Storing & Analyzing Your M...RainStor
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
The Most Trusted In-Memory database in the world- Altibase
The Most Trusted In-Memory database in the world- AltibaseThe Most Trusted In-Memory database in the world- Altibase
The Most Trusted In-Memory database in the world- AltibaseAltibase
 
Intelligent Integration OOW2017 - Jeff Pollock
Intelligent Integration OOW2017 - Jeff PollockIntelligent Integration OOW2017 - Jeff Pollock
Intelligent Integration OOW2017 - Jeff PollockJeffrey T. Pollock
 
Data Warehouse Offload
Data Warehouse OffloadData Warehouse Offload
Data Warehouse OffloadJohn Berns
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongCeph Community
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Hortonworks
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 

Similar to Big data at United Airlines (20)

EMC Isilon Database Converged deck
EMC Isilon Database Converged deckEMC Isilon Database Converged deck
EMC Isilon Database Converged deck
 
Enterprise Data Lakes
Enterprise Data LakesEnterprise Data Lakes
Enterprise Data Lakes
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
 
Real-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsReal-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven Applications
 
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Rain stor isilon_emc_real_Examine the Real Cost of Storing & Analyzing Your M...
Rain stor isilon_emc_real_Examine the Real Cost of Storing & Analyzing Your M...Rain stor isilon_emc_real_Examine the Real Cost of Storing & Analyzing Your M...
Rain stor isilon_emc_real_Examine the Real Cost of Storing & Analyzing Your M...
 
Modernise your EDW - Data Lake
Modernise your EDW - Data LakeModernise your EDW - Data Lake
Modernise your EDW - Data Lake
 
Splice machine-bloor-webinar-data-lakes
Splice machine-bloor-webinar-data-lakesSplice machine-bloor-webinar-data-lakes
Splice machine-bloor-webinar-data-lakes
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Kafka & Hadoop in Rakuten
Kafka & Hadoop in RakutenKafka & Hadoop in Rakuten
Kafka & Hadoop in Rakuten
 
The Most Trusted In-Memory database in the world- Altibase
The Most Trusted In-Memory database in the world- AltibaseThe Most Trusted In-Memory database in the world- Altibase
The Most Trusted In-Memory database in the world- Altibase
 
Intelligent Integration OOW2017 - Jeff Pollock
Intelligent Integration OOW2017 - Jeff PollockIntelligent Integration OOW2017 - Jeff Pollock
Intelligent Integration OOW2017 - Jeff Pollock
 
Data Warehouse Offload
Data Warehouse OffloadData Warehouse Offload
Data Warehouse Offload
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 

Recently uploaded (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 

Big data at United Airlines

  • 1. Big Data At United Airlines Joe Olson Senior Manager, Big Data Analytics DataWorks Summit San Jose - June 2018
  • 2. Agenda Data Landscape at United Current Big Data Analytics Environment Target Big Data Analytics Environment A Few Big Data Analytics Use Cases
  • 3. 2 About United Airlines…..  ~ 750 aircraft, with 250+ on order (supply chain)  148M passengers in 2017 (public facing web site, mobile app, time / geospatial based inventory, loyalty program, surveys, ancillary sales)  4500 daily departures (scheduling, operations, weather, route planning)  338 airports served, in 49 countries (baggage claim, check-ins)  86,000 employees (scheduling, pay)  Constantly in motion! Future (and past) always changing.  A data scientist / data engineer dream. Source: https://hub.united.com/corporate-fact-sheet/
  • 4. 3 Goals Of The Enterprise Analytics Platform  Improve Customer Experience - How can we reduce friction when booking a reservation? Maneuvering through an airport? - How can we deliver a consistent message across all channels? (mobile app, web site, social media etc)  Improve Employee Experience - How can we keep employees better informed of the current situation so they can relay it to the customers? - What are we learning from our surveys about what the customer bases says is / isn’t working?  Revenue Generation - What personalized offers can we make to our customers? - Are our offers competitive with the rest of the industry?  Improve Operational Reliability - How can we better prepare for weather or other operational interruptions? - How can we manage the fleet better and insure spare parts are where they need to be?
  • 5. 4 Industry Ideas – Customer Experience
  • 6. 5 Current Analytics Environment  Two Main Data Warehouse Platforms - Teradata – mature data platform, in place for 20+ years. Dedicated team of 25+ people. ACID compliance allowing for updates. Most ETL here tightly coupled with platform. - Hortonworks Platform – emerging technology. Economical data science. Data lake friendly. Community and support frameworks changing faster than more mature Teradata. Log parsing. Unstructured data and streaming message friendly. Schema-on-read. - How to get these to play together nicely?  Enterprise Analytics Team Skills - Very comfortable with SQL – jobs and dash boarding. - Not so comfortable with parallel processing and APIs. - Dependency on Hive.
  • 7. 6 Current Analytics Environment Systems of Record: - Bookings - Operations - Customer / Loyalty - Supply Chain - Logs (merch, seat browsing, etc) ETL Systems of Truth: ETL
  • 8. 7 Challenge #1 – Data Analytics / Science Where The Data Ain’t  Bookings & flight schedule constantly in motion – all captured in real time in Teradata - New state = current state + change  24 hr lagging snapshot refreshes for data science? - Teradata not optimized for “give me what changed yesterday” – especially in <k,v> situations. - Extra bookkeeping TD side to enable offload for data science?  Straight to the source into data lake? - ACID tables Hortonworks side? Write optimization compromises read. - Updates not be able to keep up with stream – Hive concurrency model - Stream to raw, batch process after lands on disk? Introduces latency.  Pass though queries? - Still uses Teradata resources – Spool space.
  • 9. 8 Challenge #1A – Structuring Data Big Data Side  Bookings & flight schedule – mature relational model with (heavy) secondary indexing - Needs to be queried from multiple directions - LLAP cache of bookings and flight schedule? Enough space in RAM? - De-normalized data model • Not practical in a lot of cases. - Partitioning, bucketing, ACID. • Hive concurrency model read blocks write and write blocks read. Complicates job scheduling.
  • 10. 9 So What’s Working?  Data sync Teradata -> Hive – QueryGrid (Teradata) - Pass through queries vs data replication - For replication, 4 – 5 patterns practical: • ‘Small’ data sets • ‘Large’ data sets where new data is append only and immutable (Think appending yesterday on a as a new partition) • ‘Large’ data sets where new data changes ‘small’ number of existing partitions (Think yesterday’s changes can affect data going back a full year) - Works even better if full year is partitioned by month, rather than by day. (create new) • ‘Large’ data sets accessed in a <k,v> manner. (ACID) - May need to re-partition a bucketed data set to allow time series queries
  • 11. 10 Analytics Environment Systems of Record: - Bookings - Operations - Customer / Loyalty - Supply Chain - Logs (merch, seat browsing, etc) ETL Systems of Truth: ETL QG Option #1 replicate data Queries served using only HDP resources
  • 12. 11 Analytics Environment Systems of Record: - Bookings - Operations - Customer / Loyalty - Supply Chain - Logs (merch, seat browsing, etc) ETL Systems of Truth: ETL QG Option #2 database link Queries served using Teradata resources
  • 13. 12 So What’s Working?  Longer Term - Platform Independent ETL - Nifi - Nifi – stateless streaming, and stateful streaming where latency can be tolerated. • Append only to disk + consolidation job - Common ingestion layer - Need connectors from operational systems. Not always easy due to ‘operations’ Option to buffer here, or run compaction job external to Nifi Cosmetic enrichment. or Can also be replaced with a custom (k,v) parser
  • 14. 13 So What’s NOT Working (yet)?  Data sync Teradata -> Hive – QueryGrid (Teradata) - ‘Large’ data sets where new data changes ‘large’ number of existing partitions. - Leveraging QG’s pass-through query abilities here.  Platform Independent ETL - Streaming stateful messages • Customized C++ code / Teradata • Hortonworks Data Flow, Apache Apex, Apache Flink, Nifi + Hbase, Spark micro batching. - Enterprise message bus - issues • Not designed with analytics in mind • No schema registry
  • 15. 14 Target Architecture – Other Considerations  Security - Common Security strategy with Teradata - GDPR • Groups defined in Active Directory based on access needs, user assigned to them. • Groups and users replicated to Teradata and Apache Ranger • Database roles / permissions defined and reviewed on each platform  Governance - Looking for a (reasonably priced) solution covering both platforms. - Apache Atlas – Traceability through Hive, Nifi, HDFS, and Spark (soon) is encouraging. - May have to resort to custom development using APIs
  • 16. 15 State Store Target Architecture Data Lake / Curated Layer 15 Batch sources FTP, SCP Enterprise Message Bus (JMS sources: Apache Kafka, IBM MQ Series, Tibco EMS) Data Lake Hortonworks (ORC on HDFS) 7 Stateless / Stateful High Latency Tolerant Common Ingestion Layer Stateful, Low Latency Ingestion Layer Curated Layer Teradata, Hortonworks Spark ETL Apache Nifi Advanced Analytics / ML / Data Science Analytics / KPI Dashboards SQL Spark, SAS, R, etc
  • 17. 16 Analytics Environment Systems of Record: - Logs - Operations - Customer / Loyalty - Supply Chain - Bookings Systems of Truth: Batch sources FTP, SCP Enterprise Message Bus Stateless / Stateful High Latency Tolerant Ingestion Layer Stateful, Low Latency Ingestion Layer Platform Independent ETL ??? Raw Data Lake Curated Layer Flight Narrative Trip Narrative Active Trip Narrative History
  • 18. 17 Use Case: Flight Narrative LAX – ORD UA 2032 06/11/18 11:00pm Added to schedule Aircraft assigned (737-800) #0523 Equipment change 737-800 #0215 Seat reaccomodation (click to see impact) Crew schedule finalized Gate assignment B22 Departure change 11:22pm (Late Inbound Crew) MRD released Boarding begins Catering Boarding ends Last bag scanned Out/Off/Taxi On/In/Taxi Bags delivered to claim All events that can be tied to a unique flight are stored in a time series JSON objects <T, E, [<k,v>,<k,v>…]> Inflight Stats Altitude Temperature Wind Fuel Catering Catering Arrival Time Catering Inventory Catering Sign off time Crew List Pilot Flight Attendants 02/01/18 – 1:00pm 05/01/18 – 2:30pm 06/02/18 – 10:15am 06/02/18 – 10:20am 06/09/18 – 11:20am 06/10/18 – 9:00pm 06/11/18 – 5:00 pm 06/11/18 – 8:00 pm 06/11/18 – 11:00pm 06/11/18 – 11:25pm 06/11/18 – 11:27pm 06/11/18 – 11:28pm 06/11/18 – 11:32pm 06/12/18 – 5:30am 06/12/18 – 6:05am Bag Data Gate Checked Bags (Predicted/Actual) Bulkhead Timeout # of Checked Bags First/Last Bag Scanned on board First/Last Bag Scanned to baggage claim
  • 19. 18 Ticket Issued Schedule Change Itinerary Change Ancillary Purchase Return to Blocks Denied Boarding Bag Delivered to Claim Rebooked on OA Cleared Standby In/Out/On/Off Upgrade Cleared Flight Status Notification Sent Mis-connect Staisfaction Survey Submitted Bag File Opened Pre-Travel Day-of-Travel Post-Travel • Trip Narrative is a chronological collection of events that define a customer’s experience: Flight Delayed / Cancelled Use Case: Trip Narrative
  • 20. Q & A We’re hiring! - Data Engineers - Data Scientists