SlideShare a Scribd company logo
OVERVIEW
BIGDATA
beCloudReady
Big Data
■ Generic term used for large amount of Data ( structured + unstructured ) Peta bytes.
– Facebook Data
– Google Data
– NASA Data
■ Not “Really” Big Data
– Most back office Data eg employee record etc
– BankingTransaction Data
– But, same concepts could be applied.
Apache Hadoop
■ Big Data does not mean Hadoop.
■ Apache Hadoop is an open-source software framework used for distributed storage
and processing of dataset of big data using the MapReduce programming model on
commodity hardware.
Components of Hadoop Ecosystem
Hadoop Distributed File System
(HDFS)
Distributed Batch Processing
(Map Reduce)
Resource Management (YARN)
Apache Spark
■ Spark and its RDDs were developed in 2012 in response to limitations in the MapReduce
cluster computing paradigm, which forces a particular linear dataflow structure on
distributed programs: MapReduce programs read input data from disk, map a function
across the data, reduce the results of the map, and store reduction results on disk.
Spark's RDDs function as a working set for distributed programs that offers a
(deliberately) restricted form of distributed.
Spark & Hadoop
■ Spark and Hadoop not necessarily compete with one another, rather complement each
other.
Apache Hadoop Apache Spark
Map reduce &YARN base system Spark & RDD
Mostly for Batch processing Can we used for stream processing
Optimized for cheap hardware RAM heavy operations
Not easy API ( need to be Rockstar Java Dev) Easy API ( Python, R, Scala, Java )
Elasticsearch
NoSQL full text search DataBase
Not ACID compliant like Oracle,
MySQL
Designed for distributed computing
rather than centralized
computation.
SQL vs NoSQL ( broadly )
NoSQL Databases Relational Databases
Designed for performance Designed for integrity
No relational schema Predefined schema
Various storage models Stored as individual records
Better for a lot of reads and writes Better for a lot of search and query
Auto-sharding Sharding is not implicitly supported
Limited query abilities Well-defined, advanced and standardized query
language
Apache Kafka
Apache Kafka
• Publish and subscribe to streams of records.
• Store streams of records in a fault-tolerant way.
• Process streams of records as they occur.
• Eg: LinkedIn Notification
Data Engineering
Acquiring
• Data Gathering
• Sampling
• Data Ingestion
Data
Preparation
• DataWrangling
• Data Cleaning
• Transformation
Data analysis
• StatisticalAnalysis
• Data Modeling
Data Engineering – Legacy
Acquiring
SQL insert
statements
CSV file
dumping
Data
Preparation
ETLTools
IBM - Data
Stage
Talend
Analytics
Excel based
analytics
SAS bases
analytics
IBMWatson
Data Engineering – Big Data Stack
Data
Acquisition:
• Data Ingestion
• Sqoop – Batch
Ingestion
• Kafka – Stream
Ingestion
Data Prep
• Trifacta Data
Wrangling
• Apache Spark
Data
Analytics
• Data Robot
• SAS + R based
Analytics
• IBM -Watson

More Related Content

What's hot

Big Data Architecture For enterprise
Big Data Architecture For enterpriseBig Data Architecture For enterprise
Big Data Architecture For enterpriseWei Zhang
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduceEdureka!
 
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...Edureka!
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveMike Frampton
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.Data Con LA
 
Doug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop EcosystemDoug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop EcosystemCloudera, Inc.
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiativeMansi Mehra
 

What's hot (20)

Hive and querying data
Hive and querying dataHive and querying data
Hive and querying data
 
Hadoop vs Apache Spark
Hadoop vs Apache SparkHadoop vs Apache Spark
Hadoop vs Apache Spark
 
Big Data Architecture For enterprise
Big Data Architecture For enterpriseBig Data Architecture For enterprise
Big Data Architecture For enterprise
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce
 
Big Data - Part IV
Big Data - Part IVBig Data - Part IV
Big Data - Part IV
 
Big Data - Part III
Big Data - Part IIIBig Data - Part III
Big Data - Part III
 
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop Hive
 
Spark + Cassandra
Spark + CassandraSpark + Cassandra
Spark + Cassandra
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
 
Big Data - Part II
Big Data - Part IIBig Data - Part II
Big Data - Part II
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
 
Spark Core
Spark CoreSpark Core
Spark Core
 
Doug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop EcosystemDoug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop Ecosystem
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiative
 
Big Data - Part I
Big Data - Part IBig Data - Part I
Big Data - Part I
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Apache Spark Briefing
Apache Spark BriefingApache Spark Briefing
Apache Spark Briefing
 

Similar to Big data overview

Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaAtif Akhtar
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Comparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheSandeepTaksande
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs sparkamarkayam
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015Robbie Strickland
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014mahchiev
 

Similar to Big data overview (20)

Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Comparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs Apache
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Meetup Oracle Database BCN: 2.1 Data Management Trends
Meetup Oracle Database BCN: 2.1 Data Management TrendsMeetup Oracle Database BCN: 2.1 Data Management Trends
Meetup Oracle Database BCN: 2.1 Data Management Trends
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Spark
SparkSpark
Spark
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs spark
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Big data with java
Big data with javaBig data with java
Big data with java
 
Nosql seminar
Nosql seminarNosql seminar
Nosql seminar
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 

Recently uploaded

Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单ewymefz
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIAlejandraGmez176757
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单ewymefz
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?DOT TECH
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictJack Cole
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单ocavb
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesStarCompliance.io
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单vcaxypu
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatheahmadsaood
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sMAQIB18
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单yhkoc
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单nscud
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单ewymefz
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单ukgaet
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单ewymefz
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单enxupq
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .NABLAS株式会社
 

Recently uploaded (20)

Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 

Big data overview

  • 2. Big Data ■ Generic term used for large amount of Data ( structured + unstructured ) Peta bytes. – Facebook Data – Google Data – NASA Data ■ Not “Really” Big Data – Most back office Data eg employee record etc – BankingTransaction Data – But, same concepts could be applied.
  • 3. Apache Hadoop ■ Big Data does not mean Hadoop. ■ Apache Hadoop is an open-source software framework used for distributed storage and processing of dataset of big data using the MapReduce programming model on commodity hardware.
  • 4. Components of Hadoop Ecosystem Hadoop Distributed File System (HDFS) Distributed Batch Processing (Map Reduce) Resource Management (YARN)
  • 5.
  • 6. Apache Spark ■ Spark and its RDDs were developed in 2012 in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflow structure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed.
  • 7. Spark & Hadoop ■ Spark and Hadoop not necessarily compete with one another, rather complement each other. Apache Hadoop Apache Spark Map reduce &YARN base system Spark & RDD Mostly for Batch processing Can we used for stream processing Optimized for cheap hardware RAM heavy operations Not easy API ( need to be Rockstar Java Dev) Easy API ( Python, R, Scala, Java )
  • 8. Elasticsearch NoSQL full text search DataBase Not ACID compliant like Oracle, MySQL Designed for distributed computing rather than centralized computation.
  • 9. SQL vs NoSQL ( broadly ) NoSQL Databases Relational Databases Designed for performance Designed for integrity No relational schema Predefined schema Various storage models Stored as individual records Better for a lot of reads and writes Better for a lot of search and query Auto-sharding Sharding is not implicitly supported Limited query abilities Well-defined, advanced and standardized query language
  • 11. Apache Kafka • Publish and subscribe to streams of records. • Store streams of records in a fault-tolerant way. • Process streams of records as they occur. • Eg: LinkedIn Notification
  • 12. Data Engineering Acquiring • Data Gathering • Sampling • Data Ingestion Data Preparation • DataWrangling • Data Cleaning • Transformation Data analysis • StatisticalAnalysis • Data Modeling
  • 13. Data Engineering – Legacy Acquiring SQL insert statements CSV file dumping Data Preparation ETLTools IBM - Data Stage Talend Analytics Excel based analytics SAS bases analytics IBMWatson
  • 14. Data Engineering – Big Data Stack Data Acquisition: • Data Ingestion • Sqoop – Batch Ingestion • Kafka – Stream Ingestion Data Prep • Trifacta Data Wrangling • Apache Spark Data Analytics • Data Robot • SAS + R based Analytics • IBM -Watson

Editor's Notes

  1. Components of Hadoop Architecture Core components : The Distributed Storage Framework – Hadoop Distributed File System Distributed Processing Framework - MapReduce and YARN Other supporting frameworks: Integration Frameworks – SQOOP , FLUME Management Frameworks – Ambari, ZooKeeper, Oozie Development Frameworks – Pig, Hive, Hbase, HCatalog Business Intelligence and Reporting Frameworks – Third Party Tools and Applications like SAS, Microstrategy, Splunk, Microsoft BI stack etc. Sources: Intel Distribution for Apache Hadoop: https://communities.intel.com/community/itpeernetwork/datastack/blog/2013/09/24/securing-big-data-for-the-enterprise-project-rhino-and-the-intel-distribution-for-apache-hadoop-idh Big Data Open Source Technology Stack: http://datakulfi.wordpress.com/2013/03/27/big-data-open-source-technology-landscape/ Apache Hadoop Official: http://hadoop.apache.org/ Where this all fits: http://blog.syncsort.com/wp-content/uploads/2013/11/strata_diagram.png