SlideShare a Scribd company logo
1 of 52
Building a Modern Big Data & Advanced
Analytics Pipeline
(Ideas for building UDAP)
About Us
• Emerging technology firm focused on helping enterprises build breakthrough
software solutions
• Building software solutions powered by disruptive enterprise software trends
-Machine learning and data science
-Cyber-security
-Enterprise IOT
-Powered by Cloud and Mobile
• Bringing innovation from startups and academic institutions to the enterprise
• Award winning agencies: Inc 500, American Business Awards, International
Business Awards
• The principles of big data and advanced analytics pipelines
• Some inspiration
• Capabilities
• Building a big data and advanced analytics pipeline
Agenda
The principles of an enterprise big data
infrastructure
Data Needs
Vision
Solutions
Data Science
Data Infrastructure
Data Access
There are only a few
technology choices….
Data Needs
Some inspiration….
Netflix
Data Access
Data Fetching:
Falcor(https://github.com/Ne
tflix/falcor )
Data Streaming: Apache Kafka
(http://kafka.apache.org/ )
Federated Job Execution
Engine:
Genie(https://github.com/Net
flix/genie )
Data Infrastructure
Data Lakes: Apache Hadoop
(http://hadoop.apache.org/ )
Data Compute: Apache Spark
SQL Querying: Presto
(https://prestodb.io/ )
Data Discovery : Metacat
Data Science
Multidimensional analysis:
Druid (http://druid.io/ )
Data Visualization: Sting
Machine learning: Scikit-
learn( http://scikit-
learn.org/stable/ )
Tools & Solutions
Netflix big data portal
Hadoop Search:
Inviso(https://github.com/Net
flix/inviso )
Workflow visualization
(https://github.com/Netflix/Li
pstick )
Netflix Big Data Portal
Netflix Lipstick
Netflix Data Pipeline
Spotify
Data Access
Data Fetching:
GraphQL(https://facebook.git
hub.io/react/blog/2015/05/0
1/graphql-introduction.html )
Data Streaming: Apache Kafka
(http://kafka.apache.org/ )
Data Infrastructure
Data Lakes: Apache Hadoop
(http://hadoop.apache.org/ )
Data Compute: Apache Spark
SQL Aggregation: Apache
Crunch(https://crunch.apache
.org/ )
Fast Data Access: Apache
Cassandra(http://cassandra.a
pache.org/ )
Workflow Manager
:Luigi(https://github.com/spo
tify/luigi )
Data Transformation: Apache
Falcon(http://hortonworks.co
m/hadoop/falcon/ )
Data Science
Data Visualization: Sting
Machine learning: Spark
MLib( http://scikit-
learn.org/stable/ )
Data Discovery: Raynor
Tools & Solutions
Hadoop Search:
Inviso(https://github.com/Net
flix/inviso )
Raynor
LinkedIn
Data Access
Data Streaming: Apache Kafka
(http://kafka.apache.org/ )
Data Fetching:
GraphQL(https://facebook.git
hub.io/react/blog/2015/05/0
1/graphql-introduction.html )
Data Infrastructure
Data Lakes: Apache Hadoop
(http://hadoop.apache.org/ )
Data Compute: Apache
Spark(http://www.project-
voldemort.com/voldemort/ )
Fast Data Access:
Voldemort(http://cassandra.a
pache.org/ )
Stream Analytics : Apache
Samza(http://samza.apache.o
rg/ )
Real Time Search : Zoie
(http://javasoze.github.io/zoi
e/ )
Data Science
Multidimensional analysis:
Druid (http://druid.io/ )
Data Visualization: Sting
Machine learning: Scikit-
learn( http://scikit-
learn.org/stable/ )
Data Discovery: Raynor
Tools & Solutions
Hadoop Search:
Inviso(https://github.com/Net
flix/inviso )
LinkedIn Stream Data Processing
LinkedIn Rewinder
Goldman Sachs
Data Access
Data Fetching:
GraphQL(https://facebook.git
hub.io/react/blog/2015/05/0
1/graphql-introduction.html )
Data Streaming: Apache Kafka
(http://kafka.apache.org/ )
Data Infrastructure
Data Lakes: Apache
Hadoop/HBase
(http://hadoop.apache.org/ )
Data Compute: Apache Spark
Data Transformation: Apache
Pig(http://hortonworks.com/
hadoop/falcon/ )
Stream Analytics: Apache
Storm
(http://storm.apache.org/ )
Data Science
Multidimensional analysis:
Druid (http://druid.io/ )
Data Visualization: Sting
Machine learning: Spark
MLib( http://scikit-
learn.org/stable/ )
Data Discovery: Custom data
catalog
Tools & Solutions
Secure data exchange:
Symphony
(http://www.goldmansachs.co
m/what-we-
do/engineering/see-our-
work/inside-symphony.html )
Goldman Sachs Data Exchange Architecture
Capabilities of a big data pipeline
Data Access….
• Provide the foundation for data collection and data ingestion methods at an enterprise
scale
• Support different data collection models in a consistent architecture
• Incorporate and remove data sources without impacting the overall infrastructure
Goals
• On-demand data access
• Batch data access
• Stream data access
• Data transformation
Foundational Capabilities
• Enable standard data access
protocols for line of business
systems
• Empower client applications
with data querying capabilities
• Provide data access
infrastructure building blocks
such as caching across business
data sources
On-Demand Data Access
Best Practices Interesting Technologies
• GraphQL(https://facebook.github.io/
react/blog/2015/05/01/graphql-
introduction.html )
• Odata(http://odata.org )
• Falcor
(http://netflix.github.io/falcor/ )
• Enable agile ETL models
• Support federated job
processing
Batch Data Access
Best Practices Interesting Technologies
• Genie(https://github.com/Netfl
ix/genie )
• Luigi(https://github.com/spoti
fy/luigi )
• Apache
Pig(https://pig.apache.org/ )
• Enable streaming data from
line of business systems
• Provide the infrastructure to
incorporate new data sources
such as sensors, web streams
etc
• Provide a consistent model for
data integration between line of
business systems
Stream Data Access
Best Practices Interesting Technologies
• Apache
Kafka(http://kafka.apache.org/
)
• RabbitMQ(https://www.rabbit
mq.com/ )
• ZeroMQ(http://zeromq.org/ )
• Many others….
• Enable federated aggregation of
disparate data sources
• Focus on small data sources
• Enable standard protocols to
access the federated data
sources
Data Virtualization
Best Practices Interesting Technologies
• Denodo(http://www.denodo.co
m/en )
• JBoss Data
Virtualization(http://www.jbos
s.org/products/datavirt/overvie
w/ )
Data Infrastructure….
• Store heterogeneous business data at scale
• Provide consistent models to aggregate and compose data sources from different data
sources
• Manage and curate business data sources
• Discover and consume data available in your organization
Goals
• Data lakes
• Data quality
• Data discovery
• Data transformation
Foundational Capabilities
• Focus on complementing and
expanding our data warehouse
capabilities
• Optimize the data lake to
incorporate heterogeneous data
sources
• Support multiple data ingestion
models
• Consider a hybrid cloud
strategy (pilot vs. production )
Data Lakes
Best Practices Interesting Technologies
• Hadoop(http://hadoop.apache.org/ )
• Hive(https://hive.apache.org/ )
• Hbase(https://hbase.apache.org/ )
• Spark(http://spark.apache.org/ )
• Greenplum(http://greenplum.org/ )
• Many others….
• Avoid traditional data quality
methodologies
• Leverage machine learning to
streamline data quality rules
• Leverage modern data quality
platforms
• Crowsourced vs. centralized
data quality models
Data Quality
Best Practices Interesting Technologies
• Trifacta(http://trifacta.com )
• Tamr(http://tamr.com )
• Alation(https://alation.com/ )
• Paxata(http://www.paxata.com/ )
• Master management solutions
don’t work with modern data
sources
• Promote crow-sourced vs.
centralized data publishing
• Focus on user experience
• Consider build vs. buy options
Data Discovery
Best Practices Interesting Technologies
• Tamr(http://tamr.com )
• Custom solutions…
• Spotify Raynor
• Netflix big data portal
• Enable programmable ETLs
• Support data transformations
for both batch and real time
data sources
• Agility over robustness
Data Transformations
Best Practices Interesting Technologies
• Apache
Pig(https://pig.apache.org/ )
• Streamsets(https://streamsets.
com/ )
• Apache Spark
(http://spark.apache.org/ )
Data Science….
• Discover insights of business data sources
• Integrate machine learning capabilities as part of the enterprise data pipeline
• Provide the foundation for predictive analytic capabilities across the enterprise
• Enable programmatic execution of machine learning models
Goals
• Data visualization & self-service BI
• Predictive analytics
• Stream analytics
• Proactive analytics
Foundational Capabilities
• Access business data sources
from mainstream data
visualization tools like Excel ,
Tableau, QlickView, Datameer,
etc.
• Publish data visualizations so
that they can be discovered by
other information workers
• Embed visualization as part of
existing line of business
solutions
Data Visualization and Self-Service BI
Best Practices Interesting Technologies
• Tableau(http://www.tableau.com/ )
• PowerBI(https://powerbi.microsoft.co
m/en-us/ )
• Datameer(http://www.datameer.com/
)
• QlikView(http://www.qlik.com/ )
• Visualization libraries
• ….
• Implement the tools and
frameworks to author machine
learning models using business
data sources
• Expose predictive models via
programmable APIs
• Provide the infrastructure to
test, train and evaluate machine
learning models
Predictive Analytics
Best Practices Interesting Technologies
• Spark
Mlib(http://spark.apache.org/docs/la
test/mllib-guide.html )
• Scikit-Learn(http://scikit-learn.org/ )
• Dato(https://dato.com/ )
• H20.ai(http://www.h2o.ai/ )
• ….
• Aggregate data real time from
diverse data sources
• Model static queries over
dynamic streams of data
• Create simulations and replays
of real data streams
Stream Analytics
Best Practices Interesting Technologies
• Apache
Storm(http://storm.apache.org/ )
• Spark Streaming
(http://spark.apache.org/streaming/
)
• Apache
Samza(http://samza.apache.org/ )
• ….
• Automate actions based on the
output of predictive models
• Use programmatic models to
script proactive analytics
business rules
• Continuously test and validate
proactive rules
Proactive Analytics
Best Practices Interesting Technologies
• Spark
Mlib(http://spark.apache.org/d
ocs/latest/mllib-guide.html )
• Scikit-Learn(http://scikit-
learn.org/ )
Solutions….
• Leverage a consistent data pipeline as part of all solutions
• Empower different teams to contribute to different aspects of the big data pipeline
• Keep track of key metrics about the big data pipeline such as time to deliver solutions,
data volume over time, data quality metrics, etc
Enterprise Data Solutions
• Data discovery
• Data quality
• Data testing tools
• …
Some Examples
• Mobile analytics
• Embedded analytics capabilities (ex: Salesforce Wave, Workday)
• Aggregation with external sources
• Video & image analytics
• Deep learning
• ….
Other Interesting Capabilities
Building a big data and advanced analytics
pipeline
Infrastructure-Driven
Data Storage Data Aggregation Data Transformation
Data Discovery Others…
Stream Analytics Predictive Analytics
Proactive Analytics Others…
Real time data access Batch data access Stream data access
Solution
Solution
Solution
Data
Access
Data
Infrastructure
Data
Science
Solutions
Domain-Driven
Data Storage Data Aggregation Data Transformation
Data Discovery Others…
Stream Analytics Predictive Analytics
Proactive Analytics Others…
Real time data access Batch data access Stream data access
Solution
Solution
Solution
Data
Access
Data
Infrastructure
Data
Science
Solutions
• Lead by the architecture team
• Military discipline
• Commitment from business
stakeholders
Infrastructure-Drives vs. Domain-Driven Approaches
Infrastructure-Driven Domain-Driven
• Federated data teams
• Rapid releases
• Pervasive communications
• Establish a vision across all levels of the data pipeline
• You can’t buy everything…Is likely you will build custom data infrastructure building
blocks
• Deliver infrastructure and functional capabilities incrementally
• Establish a data innovation group responsible for piloting infrastructure capabilities
ahead of production schedules
• Encourage adoption even in early stages
• Iterate
Some General Rules
Summary
• Big data and advanced analytics pipelines are based on 4 fundamental elements: data access,
data infrastructure, data science, data solutions….
• A lot of inspiration can be learned from the big data solutions built by lead internet vendors
• Establish a common vision and mission
• Start small….iterate….
Thanks
http://Tellago.com
Info@Tellago.com

More Related Content

What's hot

Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop IntroductionJayant Mukherjee
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQLkristinferrier
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
Apache spark 소개 및 실습
Apache spark 소개 및 실습Apache spark 소개 및 실습
Apache spark 소개 및 실습동현 강
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake OverviewJames Serra
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Simplilearn
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & DeltaDatabricks
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureSkillspeed
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Simplilearn
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
ODSC May 2019 - The DataOps Manifesto
ODSC May 2019 - The DataOps ManifestoODSC May 2019 - The DataOps Manifesto
ODSC May 2019 - The DataOps ManifestoDataKitchen
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 

What's hot (20)

Big Data
Big DataBig Data
Big Data
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Apache spark 소개 및 실습
Apache spark 소개 및 실습Apache spark 소개 및 실습
Apache spark 소개 및 실습
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
ODSC May 2019 - The DataOps Manifesto
ODSC May 2019 - The DataOps ManifestoODSC May 2019 - The DataOps Manifesto
ODSC May 2019 - The DataOps Manifesto
 
Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Viewers also liked

Processing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12c
Processing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12cProcessing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12c
Processing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12cGuido Schmutz
 
What Is Visualization?
What Is Visualization?What Is Visualization?
What Is Visualization?OneSpring LLC
 
An Introduction to Evaluation in Medical Visualization
An Introduction to Evaluation in Medical VisualizationAn Introduction to Evaluation in Medical Visualization
An Introduction to Evaluation in Medical VisualizationNoeska Smit
 
Information Visualization for Medical Informatics
Information Visualization for Medical Informatics Information Visualization for Medical Informatics
Information Visualization for Medical Informatics University of Maryland
 
Info vis 4-22-2013-dc-vis-meetup-shneiderman
Info vis 4-22-2013-dc-vis-meetup-shneidermanInfo vis 4-22-2013-dc-vis-meetup-shneiderman
Info vis 4-22-2013-dc-vis-meetup-shneidermanUniversity of Maryland
 
Theius: A Streaming Visualization Suite for Hadoop Clusters
Theius: A Streaming Visualization Suite for Hadoop ClustersTheius: A Streaming Visualization Suite for Hadoop Clusters
Theius: A Streaming Visualization Suite for Hadoop Clustersjtedesco5
 
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...Mia Yuan Cao
 
Text and text stream mining tutorial
Text and text stream mining tutorialText and text stream mining tutorial
Text and text stream mining tutorialmgrcar
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...Jonas Traub
 
Web 2 0 Projects Elementary
Web 2 0 Projects ElementaryWeb 2 0 Projects Elementary
Web 2 0 Projects ElementaryCinci0987
 
Towards Utilizing GPUs in Information Visualization
Towards Utilizing GPUs in Information VisualizationTowards Utilizing GPUs in Information Visualization
Towards Utilizing GPUs in Information VisualizationNiklas Elmqvist
 
Presentation Brucon - Anubisnetworks and PTCoresec
Presentation Brucon - Anubisnetworks and PTCoresecPresentation Brucon - Anubisnetworks and PTCoresec
Presentation Brucon - Anubisnetworks and PTCoresecTiago Henriques
 
Stream Processing with Kafka in Uber, Danny Yuan
Stream Processing with Kafka in Uber, Danny Yuan Stream Processing with Kafka in Uber, Danny Yuan
Stream Processing with Kafka in Uber, Danny Yuan confluent
 

Viewers also liked (14)

Processing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12c
Processing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12cProcessing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12c
Processing Twitter Events in Real-Time with Oracle Event Processing (OEP) 12c
 
What Is Visualization?
What Is Visualization?What Is Visualization?
What Is Visualization?
 
An Introduction to Evaluation in Medical Visualization
An Introduction to Evaluation in Medical VisualizationAn Introduction to Evaluation in Medical Visualization
An Introduction to Evaluation in Medical Visualization
 
Information Visualization for Medical Informatics
Information Visualization for Medical Informatics Information Visualization for Medical Informatics
Information Visualization for Medical Informatics
 
Info vis 4-22-2013-dc-vis-meetup-shneiderman
Info vis 4-22-2013-dc-vis-meetup-shneidermanInfo vis 4-22-2013-dc-vis-meetup-shneiderman
Info vis 4-22-2013-dc-vis-meetup-shneiderman
 
Theius: A Streaming Visualization Suite for Hadoop Clusters
Theius: A Streaming Visualization Suite for Hadoop ClustersTheius: A Streaming Visualization Suite for Hadoop Clusters
Theius: A Streaming Visualization Suite for Hadoop Clusters
 
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...
 
Text and text stream mining tutorial
Text and text stream mining tutorialText and text stream mining tutorial
Text and text stream mining tutorial
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...
 
Web 2 0 Projects Elementary
Web 2 0 Projects ElementaryWeb 2 0 Projects Elementary
Web 2 0 Projects Elementary
 
Towards Utilizing GPUs in Information Visualization
Towards Utilizing GPUs in Information VisualizationTowards Utilizing GPUs in Information Visualization
Towards Utilizing GPUs in Information Visualization
 
Presentation Brucon - Anubisnetworks and PTCoresec
Presentation Brucon - Anubisnetworks and PTCoresecPresentation Brucon - Anubisnetworks and PTCoresec
Presentation Brucon - Anubisnetworks and PTCoresec
 
Stream Processing with Kafka in Uber, Danny Yuan
Stream Processing with Kafka in Uber, Danny Yuan Stream Processing with Kafka in Uber, Danny Yuan
Stream Processing with Kafka in Uber, Danny Yuan
 

Similar to Building a Big Data Pipeline

10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About Jesus Rodriguez
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSSri Ambati
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsAbhishekKumarAgrahar2
 
Experfy Online Course - Gain Competitive Advantage Using Microsoft Azure Data...
Experfy Online Course - Gain Competitive Advantage Using Microsoft Azure Data...Experfy Online Course - Gain Competitive Advantage Using Microsoft Azure Data...
Experfy Online Course - Gain Competitive Advantage Using Microsoft Azure Data...Experfy
 
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02BIWUG
 
How to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePointHow to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePointJoris Poelmans
 
Microsoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the CloudMicrosoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the CloudMark Kromer
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopRTTS
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data EngineeringDurga Gadiraju
 
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)DataPad Inc.
 
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analyticsWeb Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analyticsKognitio
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsScyllaDB
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming VisualizationGuido Schmutz
 
Myth Busters II: BI Tools and Data Virtualization are Interchangeable
Myth Busters II: BI Tools and Data Virtualization are InterchangeableMyth Busters II: BI Tools and Data Virtualization are Interchangeable
Myth Busters II: BI Tools and Data Virtualization are InterchangeableDenodo
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSStéphane Fréchette
 
IARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxIARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxAIMLSEMINARS
 

Similar to Building a Big Data Pipeline (20)

10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in details
 
Experfy Online Course - Gain Competitive Advantage Using Microsoft Azure Data...
Experfy Online Course - Gain Competitive Advantage Using Microsoft Azure Data...Experfy Online Course - Gain Competitive Advantage Using Microsoft Azure Data...
Experfy Online Course - Gain Competitive Advantage Using Microsoft Azure Data...
 
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
 
How to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePointHow to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePoint
 
Microsoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the CloudMicrosoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the Cloud
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of Hadoop
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Lecture1
Lecture1Lecture1
Lecture1
 
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
 
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analyticsWeb Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data Platforms
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 
Myth Busters II: BI Tools and Data Virtualization are Interchangeable
Myth Busters II: BI Tools and Data Virtualization are InterchangeableMyth Busters II: BI Tools and Data Virtualization are Interchangeable
Myth Busters II: BI Tools and Data Virtualization are Interchangeable
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APS
 
IARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxIARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptx
 

More from Jesus Rodriguez

The Emergence of DeFi Micro-Primitives
The Emergence of DeFi Micro-PrimitivesThe Emergence of DeFi Micro-Primitives
The Emergence of DeFi Micro-PrimitivesJesus Rodriguez
 
ChatGPT, Foundation Models and Web3.pptx
ChatGPT, Foundation Models and Web3.pptxChatGPT, Foundation Models and Web3.pptx
ChatGPT, Foundation Models and Web3.pptxJesus Rodriguez
 
DeFi Opportunities and Challenges in the Current Crypto Market
DeFi Opportunities and Challenges in the Current Crypto MarketDeFi Opportunities and Challenges in the Current Crypto Market
DeFi Opportunities and Challenges in the Current Crypto MarketJesus Rodriguez
 
The Polygon Blockchain by the Numbers
The Polygon Blockchain by the NumbersThe Polygon Blockchain by the Numbers
The Polygon Blockchain by the NumbersJesus Rodriguez
 
Social Analytics for Cryptocurrencies
Social Analytics for Cryptocurrencies Social Analytics for Cryptocurrencies
Social Analytics for Cryptocurrencies Jesus Rodriguez
 
DeFi Quant Yield-Generating Strategies
DeFi Quant Yield-Generating StrategiesDeFi Quant Yield-Generating Strategies
DeFi Quant Yield-Generating StrategiesJesus Rodriguez
 
High Frequency Trading and DeFi
High Frequency Trading and DeFiHigh Frequency Trading and DeFi
High Frequency Trading and DeFiJesus Rodriguez
 
Simple DeFi Analytics Any Crypto-Investor Should Know About
Simple DeFi Analytics Any Crypto-Investor Should Know About Simple DeFi Analytics Any Crypto-Investor Should Know About
Simple DeFi Analytics Any Crypto-Investor Should Know About Jesus Rodriguez
 
15 Minutes of DeFi Analytics
15 Minutes of DeFi Analytics15 Minutes of DeFi Analytics
15 Minutes of DeFi AnalyticsJesus Rodriguez
 
DeFi Trading Strategies: Opportunities and Challenges
DeFi Trading Strategies: Opportunities and ChallengesDeFi Trading Strategies: Opportunities and Challenges
DeFi Trading Strategies: Opportunities and ChallengesJesus Rodriguez
 
Practical Crypto Asset Predictions rev
Practical Crypto Asset Predictions revPractical Crypto Asset Predictions rev
Practical Crypto Asset Predictions revJesus Rodriguez
 
Better Technical Analysis with Blockchain Indicators
Better Technical Analysis with Blockchain IndicatorsBetter Technical Analysis with Blockchain Indicators
Better Technical Analysis with Blockchain IndicatorsJesus Rodriguez
 
Price Predictions for Cryptocurrencies
Price Predictions for CryptocurrenciesPrice Predictions for Cryptocurrencies
Price Predictions for CryptocurrenciesJesus Rodriguez
 
Fascinating Metrics and Analytics About Cryptocurrencies
Fascinating Metrics and Analytics About CryptocurrenciesFascinating Metrics and Analytics About Cryptocurrencies
Fascinating Metrics and Analytics About CryptocurrenciesJesus Rodriguez
 
Price PRedictions for Crypto-Assets Using Deep Learning
Price PRedictions for Crypto-Assets Using Deep LearningPrice PRedictions for Crypto-Assets Using Deep Learning
Price PRedictions for Crypto-Assets Using Deep LearningJesus Rodriguez
 
Demystifying Centralized Crypto Exchanges using Data Science
Demystifying Centralized Crypto Exchanges using Data ScienceDemystifying Centralized Crypto Exchanges using Data Science
Demystifying Centralized Crypto Exchanges using Data ScienceJesus Rodriguez
 
Crypto assets are a data science heaven rev
Crypto assets are a data science heaven revCrypto assets are a data science heaven rev
Crypto assets are a data science heaven revJesus Rodriguez
 
Implementing Machine Learning in the Real World
Implementing Machine Learning in the Real WorldImplementing Machine Learning in the Real World
Implementing Machine Learning in the Real WorldJesus Rodriguez
 

More from Jesus Rodriguez (20)

The Emergence of DeFi Micro-Primitives
The Emergence of DeFi Micro-PrimitivesThe Emergence of DeFi Micro-Primitives
The Emergence of DeFi Micro-Primitives
 
ChatGPT, Foundation Models and Web3.pptx
ChatGPT, Foundation Models and Web3.pptxChatGPT, Foundation Models and Web3.pptx
ChatGPT, Foundation Models and Web3.pptx
 
DeFi Opportunities and Challenges in the Current Crypto Market
DeFi Opportunities and Challenges in the Current Crypto MarketDeFi Opportunities and Challenges in the Current Crypto Market
DeFi Opportunities and Challenges in the Current Crypto Market
 
MEV Deep Dive .pptx
MEV Deep Dive .pptxMEV Deep Dive .pptx
MEV Deep Dive .pptx
 
Quant in Crypto Land
Quant in Crypto LandQuant in Crypto Land
Quant in Crypto Land
 
The Polygon Blockchain by the Numbers
The Polygon Blockchain by the NumbersThe Polygon Blockchain by the Numbers
The Polygon Blockchain by the Numbers
 
Social Analytics for Cryptocurrencies
Social Analytics for Cryptocurrencies Social Analytics for Cryptocurrencies
Social Analytics for Cryptocurrencies
 
DeFi Quant Yield-Generating Strategies
DeFi Quant Yield-Generating StrategiesDeFi Quant Yield-Generating Strategies
DeFi Quant Yield-Generating Strategies
 
High Frequency Trading and DeFi
High Frequency Trading and DeFiHigh Frequency Trading and DeFi
High Frequency Trading and DeFi
 
Simple DeFi Analytics Any Crypto-Investor Should Know About
Simple DeFi Analytics Any Crypto-Investor Should Know About Simple DeFi Analytics Any Crypto-Investor Should Know About
Simple DeFi Analytics Any Crypto-Investor Should Know About
 
15 Minutes of DeFi Analytics
15 Minutes of DeFi Analytics15 Minutes of DeFi Analytics
15 Minutes of DeFi Analytics
 
DeFi Trading Strategies: Opportunities and Challenges
DeFi Trading Strategies: Opportunities and ChallengesDeFi Trading Strategies: Opportunities and Challenges
DeFi Trading Strategies: Opportunities and Challenges
 
Practical Crypto Asset Predictions rev
Practical Crypto Asset Predictions revPractical Crypto Asset Predictions rev
Practical Crypto Asset Predictions rev
 
Better Technical Analysis with Blockchain Indicators
Better Technical Analysis with Blockchain IndicatorsBetter Technical Analysis with Blockchain Indicators
Better Technical Analysis with Blockchain Indicators
 
Price Predictions for Cryptocurrencies
Price Predictions for CryptocurrenciesPrice Predictions for Cryptocurrencies
Price Predictions for Cryptocurrencies
 
Fascinating Metrics and Analytics About Cryptocurrencies
Fascinating Metrics and Analytics About CryptocurrenciesFascinating Metrics and Analytics About Cryptocurrencies
Fascinating Metrics and Analytics About Cryptocurrencies
 
Price PRedictions for Crypto-Assets Using Deep Learning
Price PRedictions for Crypto-Assets Using Deep LearningPrice PRedictions for Crypto-Assets Using Deep Learning
Price PRedictions for Crypto-Assets Using Deep Learning
 
Demystifying Centralized Crypto Exchanges using Data Science
Demystifying Centralized Crypto Exchanges using Data ScienceDemystifying Centralized Crypto Exchanges using Data Science
Demystifying Centralized Crypto Exchanges using Data Science
 
Crypto assets are a data science heaven rev
Crypto assets are a data science heaven revCrypto assets are a data science heaven rev
Crypto assets are a data science heaven rev
 
Implementing Machine Learning in the Real World
Implementing Machine Learning in the Real WorldImplementing Machine Learning in the Real World
Implementing Machine Learning in the Real World
 

Recently uploaded

Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 

Recently uploaded (20)

Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 

Building a Big Data Pipeline

  • 1. Building a Modern Big Data & Advanced Analytics Pipeline (Ideas for building UDAP)
  • 2. About Us • Emerging technology firm focused on helping enterprises build breakthrough software solutions • Building software solutions powered by disruptive enterprise software trends -Machine learning and data science -Cyber-security -Enterprise IOT -Powered by Cloud and Mobile • Bringing innovation from startups and academic institutions to the enterprise • Award winning agencies: Inc 500, American Business Awards, International Business Awards
  • 3. • The principles of big data and advanced analytics pipelines • Some inspiration • Capabilities • Building a big data and advanced analytics pipeline Agenda
  • 4. The principles of an enterprise big data infrastructure
  • 6. There are only a few technology choices….
  • 9. Netflix Data Access Data Fetching: Falcor(https://github.com/Ne tflix/falcor ) Data Streaming: Apache Kafka (http://kafka.apache.org/ ) Federated Job Execution Engine: Genie(https://github.com/Net flix/genie ) Data Infrastructure Data Lakes: Apache Hadoop (http://hadoop.apache.org/ ) Data Compute: Apache Spark SQL Querying: Presto (https://prestodb.io/ ) Data Discovery : Metacat Data Science Multidimensional analysis: Druid (http://druid.io/ ) Data Visualization: Sting Machine learning: Scikit- learn( http://scikit- learn.org/stable/ ) Tools & Solutions Netflix big data portal Hadoop Search: Inviso(https://github.com/Net flix/inviso ) Workflow visualization (https://github.com/Netflix/Li pstick )
  • 13. Spotify Data Access Data Fetching: GraphQL(https://facebook.git hub.io/react/blog/2015/05/0 1/graphql-introduction.html ) Data Streaming: Apache Kafka (http://kafka.apache.org/ ) Data Infrastructure Data Lakes: Apache Hadoop (http://hadoop.apache.org/ ) Data Compute: Apache Spark SQL Aggregation: Apache Crunch(https://crunch.apache .org/ ) Fast Data Access: Apache Cassandra(http://cassandra.a pache.org/ ) Workflow Manager :Luigi(https://github.com/spo tify/luigi ) Data Transformation: Apache Falcon(http://hortonworks.co m/hadoop/falcon/ ) Data Science Data Visualization: Sting Machine learning: Spark MLib( http://scikit- learn.org/stable/ ) Data Discovery: Raynor Tools & Solutions Hadoop Search: Inviso(https://github.com/Net flix/inviso )
  • 15. LinkedIn Data Access Data Streaming: Apache Kafka (http://kafka.apache.org/ ) Data Fetching: GraphQL(https://facebook.git hub.io/react/blog/2015/05/0 1/graphql-introduction.html ) Data Infrastructure Data Lakes: Apache Hadoop (http://hadoop.apache.org/ ) Data Compute: Apache Spark(http://www.project- voldemort.com/voldemort/ ) Fast Data Access: Voldemort(http://cassandra.a pache.org/ ) Stream Analytics : Apache Samza(http://samza.apache.o rg/ ) Real Time Search : Zoie (http://javasoze.github.io/zoi e/ ) Data Science Multidimensional analysis: Druid (http://druid.io/ ) Data Visualization: Sting Machine learning: Scikit- learn( http://scikit- learn.org/stable/ ) Data Discovery: Raynor Tools & Solutions Hadoop Search: Inviso(https://github.com/Net flix/inviso )
  • 16. LinkedIn Stream Data Processing
  • 18. Goldman Sachs Data Access Data Fetching: GraphQL(https://facebook.git hub.io/react/blog/2015/05/0 1/graphql-introduction.html ) Data Streaming: Apache Kafka (http://kafka.apache.org/ ) Data Infrastructure Data Lakes: Apache Hadoop/HBase (http://hadoop.apache.org/ ) Data Compute: Apache Spark Data Transformation: Apache Pig(http://hortonworks.com/ hadoop/falcon/ ) Stream Analytics: Apache Storm (http://storm.apache.org/ ) Data Science Multidimensional analysis: Druid (http://druid.io/ ) Data Visualization: Sting Machine learning: Spark MLib( http://scikit- learn.org/stable/ ) Data Discovery: Custom data catalog Tools & Solutions Secure data exchange: Symphony (http://www.goldmansachs.co m/what-we- do/engineering/see-our- work/inside-symphony.html )
  • 19. Goldman Sachs Data Exchange Architecture
  • 20. Capabilities of a big data pipeline
  • 22. • Provide the foundation for data collection and data ingestion methods at an enterprise scale • Support different data collection models in a consistent architecture • Incorporate and remove data sources without impacting the overall infrastructure Goals
  • 23. • On-demand data access • Batch data access • Stream data access • Data transformation Foundational Capabilities
  • 24. • Enable standard data access protocols for line of business systems • Empower client applications with data querying capabilities • Provide data access infrastructure building blocks such as caching across business data sources On-Demand Data Access Best Practices Interesting Technologies • GraphQL(https://facebook.github.io/ react/blog/2015/05/01/graphql- introduction.html ) • Odata(http://odata.org ) • Falcor (http://netflix.github.io/falcor/ )
  • 25. • Enable agile ETL models • Support federated job processing Batch Data Access Best Practices Interesting Technologies • Genie(https://github.com/Netfl ix/genie ) • Luigi(https://github.com/spoti fy/luigi ) • Apache Pig(https://pig.apache.org/ )
  • 26. • Enable streaming data from line of business systems • Provide the infrastructure to incorporate new data sources such as sensors, web streams etc • Provide a consistent model for data integration between line of business systems Stream Data Access Best Practices Interesting Technologies • Apache Kafka(http://kafka.apache.org/ ) • RabbitMQ(https://www.rabbit mq.com/ ) • ZeroMQ(http://zeromq.org/ ) • Many others….
  • 27. • Enable federated aggregation of disparate data sources • Focus on small data sources • Enable standard protocols to access the federated data sources Data Virtualization Best Practices Interesting Technologies • Denodo(http://www.denodo.co m/en ) • JBoss Data Virtualization(http://www.jbos s.org/products/datavirt/overvie w/ )
  • 29. • Store heterogeneous business data at scale • Provide consistent models to aggregate and compose data sources from different data sources • Manage and curate business data sources • Discover and consume data available in your organization Goals
  • 30. • Data lakes • Data quality • Data discovery • Data transformation Foundational Capabilities
  • 31. • Focus on complementing and expanding our data warehouse capabilities • Optimize the data lake to incorporate heterogeneous data sources • Support multiple data ingestion models • Consider a hybrid cloud strategy (pilot vs. production ) Data Lakes Best Practices Interesting Technologies • Hadoop(http://hadoop.apache.org/ ) • Hive(https://hive.apache.org/ ) • Hbase(https://hbase.apache.org/ ) • Spark(http://spark.apache.org/ ) • Greenplum(http://greenplum.org/ ) • Many others….
  • 32. • Avoid traditional data quality methodologies • Leverage machine learning to streamline data quality rules • Leverage modern data quality platforms • Crowsourced vs. centralized data quality models Data Quality Best Practices Interesting Technologies • Trifacta(http://trifacta.com ) • Tamr(http://tamr.com ) • Alation(https://alation.com/ ) • Paxata(http://www.paxata.com/ )
  • 33. • Master management solutions don’t work with modern data sources • Promote crow-sourced vs. centralized data publishing • Focus on user experience • Consider build vs. buy options Data Discovery Best Practices Interesting Technologies • Tamr(http://tamr.com ) • Custom solutions… • Spotify Raynor • Netflix big data portal
  • 34. • Enable programmable ETLs • Support data transformations for both batch and real time data sources • Agility over robustness Data Transformations Best Practices Interesting Technologies • Apache Pig(https://pig.apache.org/ ) • Streamsets(https://streamsets. com/ ) • Apache Spark (http://spark.apache.org/ )
  • 36. • Discover insights of business data sources • Integrate machine learning capabilities as part of the enterprise data pipeline • Provide the foundation for predictive analytic capabilities across the enterprise • Enable programmatic execution of machine learning models Goals
  • 37. • Data visualization & self-service BI • Predictive analytics • Stream analytics • Proactive analytics Foundational Capabilities
  • 38. • Access business data sources from mainstream data visualization tools like Excel , Tableau, QlickView, Datameer, etc. • Publish data visualizations so that they can be discovered by other information workers • Embed visualization as part of existing line of business solutions Data Visualization and Self-Service BI Best Practices Interesting Technologies • Tableau(http://www.tableau.com/ ) • PowerBI(https://powerbi.microsoft.co m/en-us/ ) • Datameer(http://www.datameer.com/ ) • QlikView(http://www.qlik.com/ ) • Visualization libraries • ….
  • 39. • Implement the tools and frameworks to author machine learning models using business data sources • Expose predictive models via programmable APIs • Provide the infrastructure to test, train and evaluate machine learning models Predictive Analytics Best Practices Interesting Technologies • Spark Mlib(http://spark.apache.org/docs/la test/mllib-guide.html ) • Scikit-Learn(http://scikit-learn.org/ ) • Dato(https://dato.com/ ) • H20.ai(http://www.h2o.ai/ ) • ….
  • 40. • Aggregate data real time from diverse data sources • Model static queries over dynamic streams of data • Create simulations and replays of real data streams Stream Analytics Best Practices Interesting Technologies • Apache Storm(http://storm.apache.org/ ) • Spark Streaming (http://spark.apache.org/streaming/ ) • Apache Samza(http://samza.apache.org/ ) • ….
  • 41. • Automate actions based on the output of predictive models • Use programmatic models to script proactive analytics business rules • Continuously test and validate proactive rules Proactive Analytics Best Practices Interesting Technologies • Spark Mlib(http://spark.apache.org/d ocs/latest/mllib-guide.html ) • Scikit-Learn(http://scikit- learn.org/ )
  • 43. • Leverage a consistent data pipeline as part of all solutions • Empower different teams to contribute to different aspects of the big data pipeline • Keep track of key metrics about the big data pipeline such as time to deliver solutions, data volume over time, data quality metrics, etc Enterprise Data Solutions
  • 44. • Data discovery • Data quality • Data testing tools • … Some Examples
  • 45. • Mobile analytics • Embedded analytics capabilities (ex: Salesforce Wave, Workday) • Aggregation with external sources • Video & image analytics • Deep learning • …. Other Interesting Capabilities
  • 46. Building a big data and advanced analytics pipeline
  • 47. Infrastructure-Driven Data Storage Data Aggregation Data Transformation Data Discovery Others… Stream Analytics Predictive Analytics Proactive Analytics Others… Real time data access Batch data access Stream data access Solution Solution Solution Data Access Data Infrastructure Data Science Solutions
  • 48. Domain-Driven Data Storage Data Aggregation Data Transformation Data Discovery Others… Stream Analytics Predictive Analytics Proactive Analytics Others… Real time data access Batch data access Stream data access Solution Solution Solution Data Access Data Infrastructure Data Science Solutions
  • 49. • Lead by the architecture team • Military discipline • Commitment from business stakeholders Infrastructure-Drives vs. Domain-Driven Approaches Infrastructure-Driven Domain-Driven • Federated data teams • Rapid releases • Pervasive communications
  • 50. • Establish a vision across all levels of the data pipeline • You can’t buy everything…Is likely you will build custom data infrastructure building blocks • Deliver infrastructure and functional capabilities incrementally • Establish a data innovation group responsible for piloting infrastructure capabilities ahead of production schedules • Encourage adoption even in early stages • Iterate Some General Rules
  • 51. Summary • Big data and advanced analytics pipelines are based on 4 fundamental elements: data access, data infrastructure, data science, data solutions…. • A lot of inspiration can be learned from the big data solutions built by lead internet vendors • Establish a common vision and mission • Start small….iterate….