SlideShare a Scribd company logo
1 of 13
Download to read offline
Paper Introduction
-
Big Data Architecture with Deep Learning
Dive Into Code - Nguyen Tien Thanh
I. Introduction
II. Papers reviews
1. Lambda Architecture - Real time Data Processing
2. Delta Lake: High-Performance ACID Table Storage over
Cloud Object Stores
3. Lakehouse: A New Generation of Open Platforms that
Unify Data Warehousing and Advanced Analytics
4. Conformer: Convolution-augmented Transformer for
Speech Recognition
5. Intent Detection and Slot Filling for Vietnamese
III. Conclusion
I. Introduction
➽ My graduation thesis:
Dynamic routing in urban traffic based on
traffic condition.
⊕ Sub-work:
- Prepare and process data: users’ speech
report about traffic congestion
- Store in data storage (in form of audio
files)
=> Analysis to predict traffic condition.
Try: https://bktraffic.com/home/
1. Lambda Architecture - Real time Data Processing
- Big data is increasingly popular -> Data can be very intensive in terms
of Volume, Variety and Velocity.
- Traditional batch data processing framework like Hadoop, MapReduce is not
suitable for real-time applications.
- Solution: stream data processing technique -> provide fast output for
real-time applications.
Problem: Trade-off between latency and accuracy.
=> Lambda Architecture is a big data processing architecture both
techniques: balancing the workload whether to serve Real-time or
Reliable results.
2. Delta Lake: High-Performance ACID Table Storage
over Cloud Object Stores
- Data storage plays a key role in big data architecture.
- Conventional database warehouses like SQL relational database, or even NoSQL
-> has limitation on unbounded data (data streaming in real-time, etc).
- Solution: data lake (large, cost-effective, efficient for structured and
unstructured data, etc) -> becomes the top choice in the industry
Problem: normal data lake is inefficient to manage or operate.
=> Delta Lake: add a ACID layer to data lake -> make it efficient to
query, control just like traditional data warehouses.
3. Lakehouse: A New Generation of Open Platforms that
Unify Data Warehousing and Advanced Analytics
- Adopts Delta Lake to propose the supreme of Lakehouse architecture over Data
warehouse architecture.
- With advance data storage system, unify batch & stream processing engines ->
reduce the burden of development and maintenance that Lambda architecture
presents
Delta Lake + Apache Spark + minimal specialized services
=> Full fledged big data architecture: scalable, high performance,
applicable for machine learning and deep learning techniques, etc.
Unify batch layer & speed layer
from Lambda Architecture
-> Delta Architecture
4. Conformer: Convolution-augmented Transformer for Speech
Recognition
2 prevalent AI techniques:
● Transformer (ie. BERT): capturing content-based from global interactions.
● Convolutional neural network (CNN): exploit local features
=> Novel model combines both techniques: Convolution-augmented transformer,
namely Conformer.
- Previous Transformer or CNN based models have shown promising results in
Automatic Speech Recognition (ASR).
- Conformer (based on both) -> outperforms them and becomes SOTA in ASR.
=> It can capture both local and global ‘parameters’ of an audio sequence.
Conformer encoder model architecture
5. Intent Detection and Slot Filling for Vietnamese
- Adopts from “BERT for Joint Intent Classification and Slot Filling” by a
Vietnamese AI institution (VinAI).
- Original model: based on BERT to understand human sentences (natural
language understanding).
- Its main purpose is to: detect if a sentence is of subject that we are
interested in (intent detection) and extract important information (slot
filling)
A case in the Snips dataset
III. Conclusion
The 5 papers are for my research works:
- Developing and deploy system which is efficient for big data analysis (in my
case, user speech report analysis)
- Applied solutions: Speech-to-text recognition + Text information extraction.
Paper:
[1] Yuvraj Kumar (2020). Lambda Architecture - Real time Data Processing. DOI:
10.13140/RG.2.2.19091.84004
[2] Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. DOI:
10.14778/3415478.3415560 (2020)
[3] Armbrust, Michael and Ghodsi, Ali and Xin, Reynold and Zaharia, Matei (2021). Lakehouse: A
New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics.
[4] Gulati, Anmol and Qin, James and Chiu, Chung-Cheng and Parmar, Niki and Zhang, Yu and Yu,
Jiahui and Han, Wei and Wang, Shibo and Zhang, Zhengdong and Wu, Yonghui and others
(2020). Conformer: Convolution-augmented Transformer for Speech Recognition. URL:
https://arxiv.org/abs/2005.08100
[5] Dao, Mai Hoang and Truong, Thinh Hung and Nguyen, Dat Quoc (2021). Intent Detection and
Slot Filling for Vietnamese. URL: https://arxiv.org/abs/2104.02021

More Related Content

What's hot

Big Data Platform at Pinterest
Big Data Platform at PinterestBig Data Platform at Pinterest
Big Data Platform at PinterestQubole
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedHostedbyConfluent
 
Lessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at HuluLessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at HuluDataWorks Summit
 
Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...
Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...
Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...Flink Forward
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn confluent
 
Cost Effectively and Reliably Aggregating Billions of Messages Per Day Using ...
Cost Effectively and Reliably Aggregating Billions of Messages Per Day Using ...Cost Effectively and Reliably Aggregating Billions of Messages Per Day Using ...
Cost Effectively and Reliably Aggregating Billions of Messages Per Day Using ...confluent
 
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...HostedbyConfluent
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin Databricks
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017Monal Daxini
 
Flink at netflix paypal speaker series
Flink at netflix   paypal speaker seriesFlink at netflix   paypal speaker series
Flink at netflix paypal speaker seriesMonal Daxini
 
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...Lightbend
 
Guaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, Nutanix
Guaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, NutanixGuaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, Nutanix
Guaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, NutanixHostedbyConfluent
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudRick Bilodeau
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to OneSerg Masyutin
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesDatabricks
 
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...Lightbend
 
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
Hadoop made fast - Why Virtual Reality Needed Stream Processing to SurviveHadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Surviveconfluent
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraDatabricks
 

What's hot (20)

Big Data Platform at Pinterest
Big Data Platform at PinterestBig Data Platform at Pinterest
Big Data Platform at Pinterest
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
 
Lessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at HuluLessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at Hulu
 
Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...
Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...
Flink Forward Berlin 2017: Aris Kyriakos Koliopoulos - Drivetribe's Kappa Arc...
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
 
Cost Effectively and Reliably Aggregating Billions of Messages Per Day Using ...
Cost Effectively and Reliably Aggregating Billions of Messages Per Day Using ...Cost Effectively and Reliably Aggregating Billions of Messages Per Day Using ...
Cost Effectively and Reliably Aggregating Billions of Messages Per Day Using ...
 
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
 
Flink at netflix paypal speaker series
Flink at netflix   paypal speaker seriesFlink at netflix   paypal speaker series
Flink at netflix paypal speaker series
 
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...
 
Guaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, Nutanix
Guaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, NutanixGuaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, Nutanix
Guaranteed Event Delivery with Kafka and NodeJS | Amitesh Madhur, Nutanix
 
Self-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons LearnedSelf-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons Learned
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on Spark
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
 
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
 
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
Hadoop made fast - Why Virtual Reality Needed Stream Processing to SurviveHadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
 

Similar to Ml sprint16 thesis_intro

Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
Big Data to SMART Data : Process Scenario
Big Data to SMART Data : Process ScenarioBig Data to SMART Data : Process Scenario
Big Data to SMART Data : Process ScenarioCHAKER ALLAOUI
 
From Single Purpose to Multi Purpose Data Lakes - Broadening End Users
From Single Purpose to Multi Purpose Data Lakes - Broadening End UsersFrom Single Purpose to Multi Purpose Data Lakes - Broadening End Users
From Single Purpose to Multi Purpose Data Lakes - Broadening End UsersDenodo
 
TechEvent Building a Data Lake
TechEvent Building a Data LakeTechEvent Building a Data Lake
TechEvent Building a Data LakeTrivadis
 
Hadoop and Beyond
Hadoop and BeyondHadoop and Beyond
Hadoop and BeyondPaco Nathan
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-HadoopNagarjuna D.N
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptxbetalab
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Denodo
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22marpierc
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...Altan Khendup
 
GTC-DC 2017 Session: Advanced Analytics and Machine Learning with Geospatial ...
GTC-DC 2017 Session: Advanced Analytics and Machine Learning with Geospatial ...GTC-DC 2017 Session: Advanced Analytics and Machine Learning with Geospatial ...
GTC-DC 2017 Session: Advanced Analytics and Machine Learning with Geospatial ...Kinetica
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...DataWorks Summit
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Denodo
 

Similar to Ml sprint16 thesis_intro (20)

Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Big Data to SMART Data : Process Scenario
Big Data to SMART Data : Process ScenarioBig Data to SMART Data : Process Scenario
Big Data to SMART Data : Process Scenario
 
From Single Purpose to Multi Purpose Data Lakes - Broadening End Users
From Single Purpose to Multi Purpose Data Lakes - Broadening End UsersFrom Single Purpose to Multi Purpose Data Lakes - Broadening End Users
From Single Purpose to Multi Purpose Data Lakes - Broadening End Users
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
TechEvent Building a Data Lake
TechEvent Building a Data LakeTechEvent Building a Data Lake
TechEvent Building a Data Lake
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Hadoop and Beyond
Hadoop and BeyondHadoop and Beyond
Hadoop and Beyond
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
 
GTC-DC 2017 Session: Advanced Analytics and Machine Learning with Geospatial ...
GTC-DC 2017 Session: Advanced Analytics and Machine Learning with Geospatial ...GTC-DC 2017 Session: Advanced Analytics and Machine Learning with Geospatial ...
GTC-DC 2017 Session: Advanced Analytics and Machine Learning with Geospatial ...
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
 

Recently uploaded

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 

Recently uploaded (20)

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 

Ml sprint16 thesis_intro

  • 1. Paper Introduction - Big Data Architecture with Deep Learning Dive Into Code - Nguyen Tien Thanh
  • 2. I. Introduction II. Papers reviews 1. Lambda Architecture - Real time Data Processing 2. Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores 3. Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics 4. Conformer: Convolution-augmented Transformer for Speech Recognition 5. Intent Detection and Slot Filling for Vietnamese III. Conclusion
  • 3. I. Introduction ➽ My graduation thesis: Dynamic routing in urban traffic based on traffic condition. ⊕ Sub-work: - Prepare and process data: users’ speech report about traffic congestion - Store in data storage (in form of audio files) => Analysis to predict traffic condition. Try: https://bktraffic.com/home/
  • 4. 1. Lambda Architecture - Real time Data Processing - Big data is increasingly popular -> Data can be very intensive in terms of Volume, Variety and Velocity. - Traditional batch data processing framework like Hadoop, MapReduce is not suitable for real-time applications. - Solution: stream data processing technique -> provide fast output for real-time applications. Problem: Trade-off between latency and accuracy. => Lambda Architecture is a big data processing architecture both techniques: balancing the workload whether to serve Real-time or Reliable results.
  • 5.
  • 6. 2. Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores - Data storage plays a key role in big data architecture. - Conventional database warehouses like SQL relational database, or even NoSQL -> has limitation on unbounded data (data streaming in real-time, etc). - Solution: data lake (large, cost-effective, efficient for structured and unstructured data, etc) -> becomes the top choice in the industry Problem: normal data lake is inefficient to manage or operate. => Delta Lake: add a ACID layer to data lake -> make it efficient to query, control just like traditional data warehouses.
  • 7. 3. Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics - Adopts Delta Lake to propose the supreme of Lakehouse architecture over Data warehouse architecture. - With advance data storage system, unify batch & stream processing engines -> reduce the burden of development and maintenance that Lambda architecture presents Delta Lake + Apache Spark + minimal specialized services => Full fledged big data architecture: scalable, high performance, applicable for machine learning and deep learning techniques, etc.
  • 8. Unify batch layer & speed layer from Lambda Architecture -> Delta Architecture
  • 9. 4. Conformer: Convolution-augmented Transformer for Speech Recognition 2 prevalent AI techniques: ● Transformer (ie. BERT): capturing content-based from global interactions. ● Convolutional neural network (CNN): exploit local features => Novel model combines both techniques: Convolution-augmented transformer, namely Conformer. - Previous Transformer or CNN based models have shown promising results in Automatic Speech Recognition (ASR). - Conformer (based on both) -> outperforms them and becomes SOTA in ASR. => It can capture both local and global ‘parameters’ of an audio sequence.
  • 10. Conformer encoder model architecture
  • 11. 5. Intent Detection and Slot Filling for Vietnamese - Adopts from “BERT for Joint Intent Classification and Slot Filling” by a Vietnamese AI institution (VinAI). - Original model: based on BERT to understand human sentences (natural language understanding). - Its main purpose is to: detect if a sentence is of subject that we are interested in (intent detection) and extract important information (slot filling)
  • 12. A case in the Snips dataset
  • 13. III. Conclusion The 5 papers are for my research works: - Developing and deploy system which is efficient for big data analysis (in my case, user speech report analysis) - Applied solutions: Speech-to-text recognition + Text information extraction. Paper: [1] Yuvraj Kumar (2020). Lambda Architecture - Real time Data Processing. DOI: 10.13140/RG.2.2.19091.84004 [2] Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. DOI: 10.14778/3415478.3415560 (2020) [3] Armbrust, Michael and Ghodsi, Ali and Xin, Reynold and Zaharia, Matei (2021). Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics. [4] Gulati, Anmol and Qin, James and Chiu, Chung-Cheng and Parmar, Niki and Zhang, Yu and Yu, Jiahui and Han, Wei and Wang, Shibo and Zhang, Zhengdong and Wu, Yonghui and others (2020). Conformer: Convolution-augmented Transformer for Speech Recognition. URL: https://arxiv.org/abs/2005.08100 [5] Dao, Mai Hoang and Truong, Thinh Hung and Nguyen, Dat Quoc (2021). Intent Detection and Slot Filling for Vietnamese. URL: https://arxiv.org/abs/2104.02021