Everybody wants to go on the “Big Data” hype cycle, “To do Scale”, to use the coolest tools in the market like Hadoop, Apache Spark, Apache Cassandra, etc.
But do they ask themselves is there really a reason for that?
In the talk we’ll make a brief overview to all of the technologies in the Big Data world nowadays and we’ll talk about the problems that really emerge when you’d like to enter the great world of Big Data handling.
Showing you the Hadoop ecosystem and Apache Spark and all of the distributed tools leading the market today, will give you all a notion of what will be the real costs entering that world.
Promise that I’ll share some stories from the trenches :)
(And about the “pool” thing...I don’t really know how to swim)
An Introduction to Rearview - Time Series Based MonitoringVictorOps
Jeff Simpson, senior software engineer at VictorOps, delivered this presentation at the Frontrange Alerting & Monitoring meetup...along with an awesome live demo.
Monitoring Big Data Systems - "The Simple Way"Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Demi Ben-Ari is a Co-Founder and CTO @ Panorays.
Demi has over 9 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems.
Describing himself as a software development groupie, Interested in tackling cutting edge technologies.
Demi is also a co-founder of the “Big Things” Big Data community: http://somebigthings.com/big-things-intro/
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Evolution of the Prometheus TSDB (Percona Live Europe 2017)Brian Brazil
Prometheus is a monitoring system with a custom time series database at its core. Prometheus 2.0 features the 3rd major iteration of this database. This talk will look at how it has evolved, and how it fits into the goal of doing metrics-based monitoring.
An Introduction to Rearview - Time Series Based MonitoringVictorOps
Jeff Simpson, senior software engineer at VictorOps, delivered this presentation at the Frontrange Alerting & Monitoring meetup...along with an awesome live demo.
Monitoring Big Data Systems - "The Simple Way"Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Demi Ben-Ari is a Co-Founder and CTO @ Panorays.
Demi has over 9 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems.
Describing himself as a software development groupie, Interested in tackling cutting edge technologies.
Demi is also a co-founder of the “Big Things” Big Data community: http://somebigthings.com/big-things-intro/
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Evolution of the Prometheus TSDB (Percona Live Europe 2017)Brian Brazil
Prometheus is a monitoring system with a custom time series database at its core. Prometheus 2.0 features the 3rd major iteration of this database. This talk will look at how it has evolved, and how it fits into the goal of doing metrics-based monitoring.
What does "monitoring" mean? (FOSDEM 2017)Brian Brazil
Monitoring can mean very different things to different people, and this often leads to confusion and misunderstandings. There are many offerings both free software and commercials, and it's not always clear where each fits in the bigger picture. This talk will look a bit at the history of monitoring, and then into the general categories of Metrics, Logs, Profiling and Distributed tracing and how each of these is important in Cloud-based environment.
Video: https://www.youtube.com/watch?v=hCBGyLRJ1qo
Building a system for machine and event-oriented data - Velocity, Santa Clara...Eric Sammer
This talk was presented at O'Reilly's Velocity conference in Santa Clara, May 28 2015.
Abstract: http://velocityconf.com/devops-web-performance-2015/public/schedule/detail/42284
This talk is about Taskerman, a distributed cluster task manager built on top of AWS SQS, Zookeeper and Yelp PaaSTA. The talk was given at Imperial College, London as part of its 'Application of Computing in Industry' series: http://www.imperial.ac.uk/computing/industry/aci/yelp/
Building highly reliable data pipeline @datadog par Quentin FrançoisParis Data Engineers !
Certaines fonctionnalités au cœur du produit de Datadog reposent sur des pipelines de données construits avec Spark qui traitent des milliers de milliards de points chaque jour. Dans cette présentation, nous verrons les grands principes que nous appliquons chez Datadog pour assurer que nos pipelines restent fiables malgré la croissance exponentielle du volume de données, les pannes matérielles, les données corrompues et les erreurs humaines.
Paris Data Eng' Meetup du 26 février 2019 @Datadog
Anatomy of a Prometheus Client Library (PromCon 2018)Brian Brazil
Prometheus client libraries are notably different from most other options in the space. In order to get the best insights into your applications it helps to know how they are designed, and why they are designed that way. This talk will look at how client libraries are structured, how that makes them easy to use, some tips for instrumentation, and why you should use them even if you aren't using Prometheus.
Evolution of Monitoring and Prometheus (Dublin 2018)Brian Brazil
This talk looks at the evolution of monitoring over time, the ways in which you can approach monitoring, where Prometheus fit into all this, and how Prometheus itself has grown over time.
Everything You wanted to Know About Distributed TracingAmuhinda Hungai
In the age of microservices, understanding how applications are executing in a highly distributed environment can be complicated. Looking at log files only gives a snapshot of the whole story and looking at a single service in isolation simply does not give enough information. Each service is just one side of a bigger story. Distributed tracing has emerged as an invaluable technique that succeeds in summarizing all sides of the story into a shared timeline. Yet deploying it can be quite challenging, especially in the large scale, polyglot environments of modern companies that mix together many different technologies. During this session, we will take a look at patterns and means to implement Tracing for services. After introducing the basic concepts we will cover how the tracing model works, and how to safely use it in production to troubleshoot and diagnose issues.
Big Data Analysis : Deciphering the haystack Srinath Perera
A primary outcome of Bigdata is to derive useful and actionable insights from large or challenges data collections. The goal is to run the transformations from data, to information, to knowledge, and finally to insights. This includes calculating simple analytics like Mean, Max, and Median, to derive overall understanding about data by building models, and finally to derive predictions from data. Some cases we can afford to wait to collect and processes them, while in other cases we need to know the outputs right away. MapReduce has been the defacto standard for data processing, and we will start our discussion from there. However, that is only one side of the problem. There are other technologies like Apache Spark and Apache Drill graining ground, and also realtime processing technologies like Stream Processing and Complex Event Processing. Finally there are lot of work on porting decision technologies like Machine learning into big data landscape. This talk discusses big data processing in general and look at each of those different technologies comparing and contrasting them.
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
Moving to a new home is daunting. Packing up all your things, getting a vehicle to move it all, unpacking it, updating your mailing address, and making sure you did not leave anything behind. Well, the move to MongoDB Atlas is similar, but all the logistics are already figured out for you by MongoDB.
Datadog is monitoring that does not suck. It's metrics friendly, people friendly and developer friendly monitoring.
Learn more at https://www.datadoghq.com/
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2lGNybu.
Stefan Krawczyk discusses how his team at StitchFix use the cloud to enable over 80 data scientists to be productive. He also talks about prototyping ideas, algorithms and analyses, how they set up & keep schemas in sync between Hive, Presto, Redshift & Spark and make access easy for their data scientists, etc. Filmed at qconsf.com..
Stefan Krawczyk is Algo Dev Platform Lead at StitchFix, where he’s leading development of the algorithm development platform. He spent formative years at Stanford, LinkedIn, Nextdoor & Idibon, working on everything from growth engineering, product engineering, data engineering, to recommendation systems, NLP, data science and business intelligence.
The state of analytics has changed dramatically over the last few years. Hadoop is now commonplace, and the ecosystem has evolved to include new tools such as Spark, Shark, and Drill, that live alongside the old MapReduce-based standards. It can be difficult to keep up with the pace of change, and newcomers are left with a dizzying variety of seemingly similar choices. This is compounded by the number of possible deployment permutations, which can cause all but the most determined to simply stick with the tried and true. In this talk I will introduce you to a powerhouse combination of Cassandra and Spark, which provides a high-speed platform for both real-time and batch analysis.
I'll provide guidelines for thinking about empirical performance evaluation of parallel programs in general and of Spark jobs in particular. It's easier to be systematic about this if you think in terms of "what's the effective network bandwidth we're getting?" instead of "How fast does this particular job run?" In addition, the figure of merit for parallel performance isn't necessarily obvious. If you want to minimize your AWS bill you should almost certainly run on a single node (but your job may take six months to finish). You may think you want answers as quickly as possible, but if you could make a job finish in 55 minutes instead 60 minutes while doubling your AWS bill, would you do it? No? Then what exactly is the metric that you should optimize?
The monolith to cloud-native, microservices evolution has driven a shift from monitoring to observability. OpenTelemetry, a merger of the OpenTracing and OpenCensus projects, is enabling Observability 2.0. This talk gives an overview of the OpenTelemetry project and then outlines some production-proven architectures for improving the observability of your applications and systems.
Talk I did on log aggregation with the ELK stack at Leeds DevOps. Covers how we process over 800,000 logs per hour at laterooms, and the cultural changes this has helped drive.
Cassandra Summit 2014: Diagnosing Problems in ProductionDataStax Academy
This sessions covers diagnosing and solving common problems encountered in production, using performance profiling tools. We’ll also give a crash course to basic JVM garbage collection tuning. Attendees will leave with a better understanding of what they should look for when they encounter problems with their in-production Cassandra cluster.
Considerations for using NoSQL technology on your next IT projectAkmal Chaudhri
The slideshare view is not great, but the downloadable PDF file is just fine.
Originally presented at:
British Computer Society (BCS) SPA-270, London, UK, 6 February 2013
http://www.bcs-spa.org/cgi-bin/view/SPA/NoSqlDatabasesForBigData
The DBMS market trends focused on the Graph DBMS. The benefit of the Graph Database and its forecasted the growth rate. The Advice from the renowned market research institute.
What does "monitoring" mean? (FOSDEM 2017)Brian Brazil
Monitoring can mean very different things to different people, and this often leads to confusion and misunderstandings. There are many offerings both free software and commercials, and it's not always clear where each fits in the bigger picture. This talk will look a bit at the history of monitoring, and then into the general categories of Metrics, Logs, Profiling and Distributed tracing and how each of these is important in Cloud-based environment.
Video: https://www.youtube.com/watch?v=hCBGyLRJ1qo
Building a system for machine and event-oriented data - Velocity, Santa Clara...Eric Sammer
This talk was presented at O'Reilly's Velocity conference in Santa Clara, May 28 2015.
Abstract: http://velocityconf.com/devops-web-performance-2015/public/schedule/detail/42284
This talk is about Taskerman, a distributed cluster task manager built on top of AWS SQS, Zookeeper and Yelp PaaSTA. The talk was given at Imperial College, London as part of its 'Application of Computing in Industry' series: http://www.imperial.ac.uk/computing/industry/aci/yelp/
Building highly reliable data pipeline @datadog par Quentin FrançoisParis Data Engineers !
Certaines fonctionnalités au cœur du produit de Datadog reposent sur des pipelines de données construits avec Spark qui traitent des milliers de milliards de points chaque jour. Dans cette présentation, nous verrons les grands principes que nous appliquons chez Datadog pour assurer que nos pipelines restent fiables malgré la croissance exponentielle du volume de données, les pannes matérielles, les données corrompues et les erreurs humaines.
Paris Data Eng' Meetup du 26 février 2019 @Datadog
Anatomy of a Prometheus Client Library (PromCon 2018)Brian Brazil
Prometheus client libraries are notably different from most other options in the space. In order to get the best insights into your applications it helps to know how they are designed, and why they are designed that way. This talk will look at how client libraries are structured, how that makes them easy to use, some tips for instrumentation, and why you should use them even if you aren't using Prometheus.
Evolution of Monitoring and Prometheus (Dublin 2018)Brian Brazil
This talk looks at the evolution of monitoring over time, the ways in which you can approach monitoring, where Prometheus fit into all this, and how Prometheus itself has grown over time.
Everything You wanted to Know About Distributed TracingAmuhinda Hungai
In the age of microservices, understanding how applications are executing in a highly distributed environment can be complicated. Looking at log files only gives a snapshot of the whole story and looking at a single service in isolation simply does not give enough information. Each service is just one side of a bigger story. Distributed tracing has emerged as an invaluable technique that succeeds in summarizing all sides of the story into a shared timeline. Yet deploying it can be quite challenging, especially in the large scale, polyglot environments of modern companies that mix together many different technologies. During this session, we will take a look at patterns and means to implement Tracing for services. After introducing the basic concepts we will cover how the tracing model works, and how to safely use it in production to troubleshoot and diagnose issues.
Big Data Analysis : Deciphering the haystack Srinath Perera
A primary outcome of Bigdata is to derive useful and actionable insights from large or challenges data collections. The goal is to run the transformations from data, to information, to knowledge, and finally to insights. This includes calculating simple analytics like Mean, Max, and Median, to derive overall understanding about data by building models, and finally to derive predictions from data. Some cases we can afford to wait to collect and processes them, while in other cases we need to know the outputs right away. MapReduce has been the defacto standard for data processing, and we will start our discussion from there. However, that is only one side of the problem. There are other technologies like Apache Spark and Apache Drill graining ground, and also realtime processing technologies like Stream Processing and Complex Event Processing. Finally there are lot of work on porting decision technologies like Machine learning into big data landscape. This talk discusses big data processing in general and look at each of those different technologies comparing and contrasting them.
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
Moving to a new home is daunting. Packing up all your things, getting a vehicle to move it all, unpacking it, updating your mailing address, and making sure you did not leave anything behind. Well, the move to MongoDB Atlas is similar, but all the logistics are already figured out for you by MongoDB.
Datadog is monitoring that does not suck. It's metrics friendly, people friendly and developer friendly monitoring.
Learn more at https://www.datadoghq.com/
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2lGNybu.
Stefan Krawczyk discusses how his team at StitchFix use the cloud to enable over 80 data scientists to be productive. He also talks about prototyping ideas, algorithms and analyses, how they set up & keep schemas in sync between Hive, Presto, Redshift & Spark and make access easy for their data scientists, etc. Filmed at qconsf.com..
Stefan Krawczyk is Algo Dev Platform Lead at StitchFix, where he’s leading development of the algorithm development platform. He spent formative years at Stanford, LinkedIn, Nextdoor & Idibon, working on everything from growth engineering, product engineering, data engineering, to recommendation systems, NLP, data science and business intelligence.
The state of analytics has changed dramatically over the last few years. Hadoop is now commonplace, and the ecosystem has evolved to include new tools such as Spark, Shark, and Drill, that live alongside the old MapReduce-based standards. It can be difficult to keep up with the pace of change, and newcomers are left with a dizzying variety of seemingly similar choices. This is compounded by the number of possible deployment permutations, which can cause all but the most determined to simply stick with the tried and true. In this talk I will introduce you to a powerhouse combination of Cassandra and Spark, which provides a high-speed platform for both real-time and batch analysis.
I'll provide guidelines for thinking about empirical performance evaluation of parallel programs in general and of Spark jobs in particular. It's easier to be systematic about this if you think in terms of "what's the effective network bandwidth we're getting?" instead of "How fast does this particular job run?" In addition, the figure of merit for parallel performance isn't necessarily obvious. If you want to minimize your AWS bill you should almost certainly run on a single node (but your job may take six months to finish). You may think you want answers as quickly as possible, but if you could make a job finish in 55 minutes instead 60 minutes while doubling your AWS bill, would you do it? No? Then what exactly is the metric that you should optimize?
The monolith to cloud-native, microservices evolution has driven a shift from monitoring to observability. OpenTelemetry, a merger of the OpenTracing and OpenCensus projects, is enabling Observability 2.0. This talk gives an overview of the OpenTelemetry project and then outlines some production-proven architectures for improving the observability of your applications and systems.
Talk I did on log aggregation with the ELK stack at Leeds DevOps. Covers how we process over 800,000 logs per hour at laterooms, and the cultural changes this has helped drive.
Cassandra Summit 2014: Diagnosing Problems in ProductionDataStax Academy
This sessions covers diagnosing and solving common problems encountered in production, using performance profiling tools. We’ll also give a crash course to basic JVM garbage collection tuning. Attendees will leave with a better understanding of what they should look for when they encounter problems with their in-production Cassandra cluster.
Considerations for using NoSQL technology on your next IT projectAkmal Chaudhri
The slideshare view is not great, but the downloadable PDF file is just fine.
Originally presented at:
British Computer Society (BCS) SPA-270, London, UK, 6 February 2013
http://www.bcs-spa.org/cgi-bin/view/SPA/NoSqlDatabasesForBigData
The DBMS market trends focused on the Graph DBMS. The benefit of the Graph Database and its forecasted the growth rate. The Advice from the renowned market research institute.
For some, Hadoop is synonymous with “Big Data,” but Hadoop is just one component of a successful Big Data architecture. Depending on one’s application, it may not even be the most important part.
NoSQL solutions like MongoDB also play a dominant role for storage and real-time data processing, helping companies keep pace with the scale of their data requirements. But NoSQL figures even more prominently in helping enterprises consume a wide variety of data sources at speeds not currently possible in Hadoop. NoSQL, then, offers a useful complement to Hadoop, as well as the transaction-based data of traditional RDBMSs.
Tackling Big Data is not a one-tool job, and so the orchestration of the appropriate NoSQL database with Hadoop and RDBMS is essential. In this session, we’ll dig deep into the different types of NoSQL, identifying how they differ and the types of Big Data workloads for which they’re best suited. We’ll also explore the trade-offs one makes in choosing NoSQL databases like MongoDB or Neo4j over an RDBMS like MySQL, and when it makes sense to use both Hadoop and NoSQL and when it’s more appropriate to use NoSQL on its own.
Mejors Pintores del Mundo-Ortega Maila-Obras Y biografiaArte Mundial
Cristóboal Ortega Maila uno de los pintores mas famosos del mundo, en su última colección "Arte ancestral" muestra toda esa energía de sus antepasados.
Cristóbal Ortega Maila is one of the most important painters in the world, his last collection is called "Ancient Art" where he shows us all the energy from his elders.
Oncology Big Data: A Mirage or Oasis of Clinical Value? Michael Peters
The title of the presentation, Oncology Big Data: A Mirage or Oasis of Clinical Value, reflects what I believe the field of Oncology is challenged with on a growing basis, from a clinical and business side perspective.
The world has changed and having one huge server won’t do the job anymore, when you’re talking about vast amounts of data, growing all the time the ability to Scale Out would be your savior. Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
This lecture will be about the basics of Apache Spark and distributed computing and the development tools needed to have a functional environment.
Thorne & Derrick UK are official distributors of the SebaKMT range of LV-HV Cable Fault Location instruments including cable sheath testers and cable sheath fault prelocation and pinpointing units. SebaKMT are the leading manufacturer of measurement equipment for the diagnosis of cable networks and for cable fault location. Cable faults are the ''natural enemy'' of reliability. The innovative SebaKMT products make it possible to quickly localise low and high voltage cable faults without causing damage to fault-free parts of the cable.
Sesion Bibliografica en el Instituto Nacional de Psiquiatria "Dr. Ramon de la Fuente Muñiz" con el tema : Factores asociados a la permanencia de los pacientes en el servicio de consuta externa de la direccion de servicios clinicos
Estudio de la influencia de elementos parásitos en el valor del roe de una an...fralbe com
En el siguiente paper, se presenta los parámetros básicos para la construcción de una Antena Yagi-Uda de 7 elementos. Primeramente se aborda consideraciones de diseño. Posteriormente se elabora simulaciones y finalmente se realiza pruebas de campo para investigar cómo influye el valor del ROE, ante la presencia de elementos parásitos.
Sniffing HTTPS Using YAMAS | Lucideus TechRahul Tyagi
Yamas is a tool that aims at facilitating mitm attacks by automating the whole process from setting up ip forwarding and modifying iptables, to the ARP cache poisoning (either using ettercap or arpspoof). The traffic is stripped off ssl with the famous sslstrip 0.9. If any mitm script does that, Yamas has a unique and appreciated feature: it parses the logs as the attack keeps running, so that credentials are displayed just as they are sniffed. The parsing method is a home-made 100% pure bash script that -so far- never missed anything. And if it did, just report it to me and I'll update the file used to parse the logs. This update is independent from the whole update process, making it a very flexible parser.
Thinking DevOps in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
The lines between Development and Operations people have gotten blurry and lots of skills needs to be held by both sides.
In the talk we'll talk about all of the considerations that are needed to be taken when creating a development and production environment, mentioning Continuous Integration, Continuous Deployment and the Buzzword "DevOps", also talking about some real implementations in the industry.
Of course how can we leave out the real enabler of the whole deal,
"The Cloud", Giving us a tool set that makes life much easier when implementing all of these practices.
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Codemotion
Once you start working with Big Data systems, you discover a whole bunch of problems you won’t find in monolithic systems. Monitoring all of the components becomes a big data problem itself. In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system using tools like: Web Services,Spark,Cassandra,MongoDB,AWS. Not only the tools, what should you monitor about the actual data that flows in the system? We’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.[2] Though used sometimes loosely partly due to a lack of formal definition, the best interpretation is that it is a large body of information that cannot be comprehended when used in small amounts only.[Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.[2] Though used sometimes loosely partly due to a lack of formal definition, the best interpretation is that it is a large body of information that cannot be comprehended when used in small amounts only.[Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.[2] Though used sometimes loosely partly due to a lack of formal definition, the best interpretation is that it is a large body of information that cannot be comprehended when used in small amounts only.[Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.[2] Though used sometimes loosely partly due to a lack of formal definition, the best interpretation is that it is a large body of information that cannot be comprehended when used in small amounts only.[Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.[2] Though used sometimes loosely partly due to a lack of formal definition, the best interpretation is that it is a large body of information that cannot be comprehended when used in small amounts only.[Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.[2] Though used sometimes loosely partly due to a lack of formal definition, the best interpretation is that it is a large body of informa
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Codemotion
Once you start working with Big Data systems, you discover a whole bunch of problems you won’t find in monolithic systems. Monitoring all of the components becomes a big data problem itself. In the talk, we’ll mention all of the aspects that you should take into consideration when monitoring a distributed system using tools like Web Services, Spark, Cassandra, MongoDB, AWS. Not only the tools, what should you monitor about the actual data that flows in the system? We’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
A brief intro on the idea of what is Big Data and it's potential. This is primarily a basic study & I have quoted the source of infographics, stats & text at the end. If I have missed any reference due to human error & you recognize another source, please mention.
Big Data made easy in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
Talking about the ease of use and handling Big Data technologies in the Cloud. Using Google Cloud Platform and Amazon Web Services and all of the tools around it.
Showing the problems and how we can solve them with simple tools.
Talk on Data Discovery and Metadata by Mark Grover from July 2019.
Goes into detail of the problem, build/buy/adopt analysis and Lyft's solution - Amundsen, along with thoughts on the future.
In 2018 we've seen a huge uptick in applications using Kubernetes for their deployment method. Many times your persistent data layer is a difficult decision. What will you store data in, how long will you need to access this data and who will manage the lifecycle of this data? These are important questions many developers and ops teams have taken to heart. In this talk we'll review how the data layer is managed for high availability and reliability in modern application deployment. The attendee should leave having a better understanding of the options in front of them and their ability to build applications in any hosting environment.
Choosing technologies for a big data solution in the cloudJames Serra
Has your company been building data warehouses for years using SQL Server? And are you now tasked with creating or moving your data warehouse to the cloud and modernizing it to support “Big Data”? What technologies and tools should use? That is what this presentation will help you answer. First we will cover what questions to ask concerning data (type, size, frequency), reporting, performance needs, on-prem vs cloud, staff technology skills, OSS requirements, cost, and MDM needs. Then we will show you common big data architecture solutions and help you to answer questions such as: Where do I store the data? Should I use a data lake? Do I still need a cube? What about Hadoop/NoSQL? Do I need the power of MPP? Should I build a "logical data warehouse"? What is this lambda architecture? Can I use Hadoop for my DW? Finally, we’ll show some architectures of real-world customer big data solutions. Come to this session to get started down the path to making the proper technology choices in moving to the cloud.
This presentation was inspired post read of "TimeSeries Databases" -- Ted Dunning & Ellen Friedman.
I have tried to summarize a lot of the previous bench marks. Hope others find it useful. The slides were compiled early 2015 so some of the results might have changed but the core literature should still hold.
Similar to Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays (20)
Thinking DevOps in the Era of the Cloud - Demi Ben-AriDemi Ben-Ari
The lines between Development and Operations people have gotten blurry and lots of skills needs to be held by both sides. In the talk we'll talk about all of the considerations that are needed to be taken when creating a development and production environment, mentioning Continuous Integration, Continuous Deployment and the Buzzword "DevOps", also talking about some real implementations in the industry. Of course how can we leave out the real enabler of the whole deal, "The Cloud", Giving us a tool set that makes life much easier when implementing all of these practices.
Kubernetes, Toolbox to fail or succeed for beginners - Demi Ben-Ari, VP R&D @...Demi Ben-Ari
Talk that specifies the history and the reasons to start using Kubernetes and implementing a microservices architecture. Talking about Docker, Kubernetes basic terms and some of the pitfalls that you can get too while implementing it.
Also mentioning the use case of Panorays.
All I Wanted Is to Found a Startup - Demi Ben-Ari - PanoraysDemi Ben-Ari
Once you have the “Great groundbreaking Idea of your life”, what are all of the mistakes that you can do to fail it in the world of entrepreneurship, we’ll talk about the idea, partners, fund raising and company culture that you’d like to create, first steps to creating the best company possible and have some fun during so.
Promise to share from experience.
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - PanoraysDemi Ben-Ari
To defend against attacks, think like a hacker. But does that mean you need to be a DevOps expert? Security researchers today need to discover new attack techniques. However, much of their focus is diverged to backend coding. We share how to build an infrastructure for researchers that allows them concentrate on business logic and writing hacker “tasks”. Using Docker and Kubernetes on Google Cloud, these tasks can then be performed in parallel and without a lot of DevOps hassle. Our technique removes two common barriers: first, long and risky deployment processes and second, low transparency within the production system.
Promise to share the stupid things too.
Community, Unifying the Geeks to Create Value - Demi Ben-AriDemi Ben-Ari
After running developer communities over the past 3 years, meeting a lot of great people and learning a lot of things about the Israeli Hi-Tech ecosystem,
I’ll share all that I’ve learned about what it actually means to create what is called a “Community”.
The value that you as a community lead can give to the people in it and the things that you can gain as a geek out of it.
You’ll be surprised to learn how easy and hard it can be at the same time.
I’ll tell about the steps that I’ve taken in this journey, what in my opinion might kill the concept of “Developer Community”,
by the end of the talk you'll have as many tools as possible for you to be able to create a community of your own.
Demi Ben Ari - Apache Spark 101 - First Steps into distributed computing:
The world has changed, having one huge server won’t do the job, the ability to Scale Out would be your savior. Apache Spark is a fast and general engine for big data processing, with streaming, SQL, machine learning and graph processing. Showing the basics of Apache Spark and distributed computing.
Demi is a Software engineer, Entrepreneur and an International Tech Speaker.
Demi has over 10 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems.
Co-Founder of the “Big Things” Big Data community and Google Developer Group Cloud.
Big Data Expert, but interested in all kinds of technologies, from front-end to backend, whatever moves data around.
Know the Startup World - Demi Ben-Ari - Ofek AlumniDemi Ben-Ari
Insights and explanation about the Hi-Tech industry and about the terms of Startup companies.
Brought by the Ofek Alumni association supporting it's alumni.
Know the Startup World - Demi Ben Ari - Ofek AlumniDemi Ben-Ari
Insights and explanation about the Hi-Tech industry and about the terms of Startup companies.
Brought by the Ofek Alumni association supporting it's alumni.
Scala like distributed collections - dumping time-series data with apache sparkDemi Ben-Ari
Spark RDDs are almost identical to Scala collection, just in a distributed manner, all of the transformations and actions are derived from the Scala collections API.
As Martin Odersky mentioned, “Spark - The Ultimate Scala Collections” is the right way to look at RDDs. But with that great distributed power comes a great many data problems: at first you’ll start tackling the concept of partitioning, then the actual data becomes the next thing to worry about.
In the talk we’ll go through an overview on Spark's architecture, and see how similar RDDs are to the Scala collections API. We'll then shift to the world of problems that you’ll be facing when using Spark for processing a vast volume of time-series data with multiple data stores (S3, MongoDB, Apache Cassandra, MySQL).
When you start tackling many scale and performance problems, many questions arise:
> How to handle missing data?
> Should the system handle both serving and backend processes, or should we separate them out?
> Which solution is cheaper?
> How do we get the best performance for money spent?
In the talk we will tell the tale of all of the transformations we’ve made to our data and review the multiple data persistency layers... and I’ll try my best NOT to answer the question “which persistency layer is the best?” but I do promise to share our pains and lessons learned!
S3 cassandra or outer space? dumping time series data using sparkDemi Ben-Ari
Vast volume of our processed data is Time Series data and once you start working with distributed systems, you start tackling many scale and performance problems, many questions arise:
How to handle missing data?
Should my system handle both serving and backed process or separating them out?
Which one of the solutions will be cheaper? Best Performance for Money?
In the talk we will tell the tale of all of the transformations we’ve made to our data model @Windward, show some of the problems we’ve handled, review the multiple data persistency layers like: S3, MongoDB, Apache Cassandra, MySQL.
And I’ll try my best NOT to answer the question “Which one of them is the Best?”
Sharing our Pain and Lessons learned is promised!
Bio:
Demi Ben-Ari, Sr. Data Engineer @Windward,
I have over 9 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems.
Co-Founder of the “Big Things” Big Data community: http://somebigthings.com/big-things-intro/
I’m a software development groupie, Interested in tackling cutting edge technologies.
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek AlumniDemi Ben-Ari
The world has changed and having one huge server won’t do the job anymore, when you’re talking about vast amounts of data, growing all the time the ability to Scale Out would be your saviour.
This lecture will be about the basics of Apache Spark and distributed computing and the development tools needed to have a functional environment.
Bio:
Demi Ben-Ari, Sr. Data Engineer @Windward, Ofek Alumni
Has over 9 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems.
Co-Founder of the “Big Things” Big Data community: http://somebigthings.com/big-things-i...
Migrating Data Pipeline from MongoDB to CassandraDemi Ben-Ari
MongoDB is a great NoSQL database, it’s very flexible and easy to use,
but would it handle massive Read / Write throughput?
actually, what happens when you need to scale everything out and easily?
We will lay out the reasons and the steps of migrating our data pipeline to Apache Cassandra in a short period without having any prior knowledge.
We’ll list our lessons learned as well.
Bio:
Demi Ben-Ari, Sr. Data Engineer @Windward,
I have over 9 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems.
Co-Organizer of the “Big Things” Big Data community:http://somebigthings.com/big-things-intro/
Transform & Analyze Time Series Data via Apache Spark @WindwardDemi Ben-Ari
High overview lecture about analyzing time series data with Apache Spark micro service application @Windward Ltd.
The ways to handle the missing data problem and the tool that we've used to handle the problem.
We will be showing the use case of the implementation of a Data Pipeline in the maritime domain @Windward via Spark applications.
The process was converting a Monolith application to a fully distributed and scalable application.
We'll be talking about all the tools and the process of taking an idea and developing Spark applications around it, And will show the development of an application End to End, from DevOps to the method of thinking about the development of applications, showing use-cases and the "lessons learned" at Windward Ltd, I hope that after the talk, it will give you some more Practical tools to "Spark"ing your way around.
Use case of the usage of Apache Spark @Windward Ltd.
Video lecture on YouTube: https://www.youtube.com/watch?v=rPO6P5YIKUI
Showing the domain of the company,
A short introduction of Apache Spark,
And the Tool Box used @Windward Ltd to form a working production Spark Data Pipeline.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
1. Quick dive into the
Big Data pool
without drowning
Demi Ben-Ari - VP R&D @ Panorays
2. About Me
Demi Ben-Ari, Co-Founder & VP R&D @ Panorays
● BS’c Computer Science – Academic College Tel-Aviv Yaffo
● Co-Founder “Big Things” Big Data Community
In the Past:
● Sr. Data Engineer - Windward
● Team Leader & Sr. Java Software Engineer,
Missile defense and Alert System - “Ofek” – IAF
Interested in almost every kind of technology – A True Geek
3. Agenda
● Basic Concepts
● Introduction to Big Data frameworks
● Distributed Systems => Problems
● Monitoring
● Conclusions
6. What is Big Data (IMHO)?
● Systems involving the “3 Vs”:
What are the right questions we want to ask?
○ Volume - How much?
○ Velocity - How fast?
○ Variety - What kind? (Difference)
7. What is Big Data (IMHO)
● Some define it the “7 Vs”
○ Variability (constantly changing)
○ Veracity (accuracy)
○ Visualization
○ Value
8. What is Big Data (IMHO)
● Characteristics
○ Multi-region availability
○ Very fast and reliable response
○ No single point of failure
9. Why Not Relational Data
● Relational Model Provides
○ Normalized table schema
○ Cross table joins
○ ACID compliance (Atomicity, Consistency, Isolation, Durability)
● But at very high cost
○ Big Data table joins - bilions of rows - massive overhead
○ Sharding tables across systems is complex and fragile
● Modern applications have different priorities
○ Needs for speed and availability come over consistency
○ Commodity servers racks trump massive high-end systems
○ Real world need for transactional guarantees is limited
10. What strategies help manage Big Data?
● Distribute data across nodes
○ Replication
● Relax consistency requirements
● Relax schema requirements
● Optimize data to suit actual needs
11. What is the NoSQL landscape?
● 4 broad classes of non-relational databases (DB-Engines)
○ Graph: data elements each relate to N others in graph / network
○ Key-Value: keys map to arbitrary values of any data type
○ Document: document sets (JSON) queryable in whole or part
○ Wide column Store (Column Family): keys mapped to sets of
n-numbers of typed columns
● Three key factors to help understand the subject
○ Consistency: Get identical results, regardless which node is queried?
○ Availability: Respond to very high read and write volumes?
○ Partition tolerance: Still available when part of it is down?
12. What is the CAP theorem?
● In distributed systems, consistency, availability and partition tolerance exist in
a manually dependant relationship, Pick any two.
Availability
Partition toleranceConsistency
MySQL, PostgreSQL,
Greenplum, Vertica,
Neo4J
Cassandra,
DynamoDB, Riak,
CouchDB, Voldemort
HBase, MongoDB, Redis, BigTable, BerkeleyDB
Graph
Key-Value
Wide Column
RDBMS
13. DB Engines - Comparison
● http://db-engines.com/en/ranking
18. Characteristics of Hadoop
● A system to process very large amounts of unstructured and complex
data with wanted speed
● A system to run on a large amount of machines that don’t share any
memory or disk
● A system to run on a cluster of machines which can put together in
relatively lower cost and easier maintenance
19. Hadoop Principals
● “A system to move the computation, where the data is”
● Key Concepts of Hadoop
Flexibility Scalability
Low cost
Fault
Tolerant
20. Hadoop Core Components
● HDFS - Hadoop Distributed File System
○ Provides a distributed data storage system to store data in smaller
blocks in a fail safe manner
● MapReduce - Programming framework
○ Has the ability to take a query over a dataset, divide it and run in in
parallel on multiple nodes
● YARN - (Yet Another Resource Negotiator) MRv2
○ Splitting a MapReduce Job Tracker’s info
■ Resource Manager (Global)
■ Application Manager (Per application)
23. New Age BI Applications
● Able to understand various types of data
● Ability to clean the data
● Process data with applied rules locally and in distributed environment
● Visualize sizeable data with speed
● Extend results by sharing within the enterprise
24. Big Data Analytics
● Processing large amounts of data without data movement
● Avoid data connectors if possible (run natively)
● Ability to understand vast amount of data types and and data
compressions
● Ability to process data on variety of processing frameworks
● Distributed data processing
○ In-Memory a big plus
● Super fast visualization
○ In-Memory a big plus
25. When to choose hadoop?
● Large volumes of data to store and process
● Semi-Structured or Unstructured data
● Data is not well categorized
● Data contains a lot of redundancy
● Data arrives in streams or large batches
● Complex batch jobs arriving in parallel
● You don’t know how the data might be useful
26. Distributed Systems => Problems
https://imgflip.com/i/1ap5kr
http://kingofwallpapers.com/otter/otter-004.jpg
27. Monolith Structure
OS CPU Memory Disk
Processes Java
Application
Server
Database
Web Server
Load
Balancer
Users - Other Applications
Monitoring
System
UI
Many times...all of this was on a single physical server!
32. Problems
● Multiple physical servers
● Multiple logical services
● Want Scaling => More Servers
● Even if you had all of the metrics
○ You’ll have an overflow of the data
● Your monitoring becomes a “Big Data” problem itself
33. This is what “Distributed” really Means
The DevOps Guy
(It might be you)
43. Ways to Monitoring Spark
● Grafana-spark-dashboards
○ Blog:
http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/
● Spark UI - Online on each application running
● Spark History Server - Offline (After application finishes)
● Spark REST API
○ Querying via inner tools to do ad-hoc monitoring
● Back to the basics: dstat, iostat, iotop, jstack
● Blog post by Tzach Zohar - “Tips from the Trenches”
45. Data Questions? What should be measure
● Did all of the computation occur?
○ Are there any data layers missing?
● How much data do we have? (Volume)
● Is all of the data in the Database?
● Data Quality Assurance
46. Data Answers!
● The method doesn’t really matter, as long as you:
○ Can follow the results over time
○ Know what your data flow, know what might fail
○ It’s easy for anyone to add more monitoring
(For the ones that add the new data each time…)
○ It don’t trust others to add monitoring
(It will always end up the DevOps’s “fault” -> No monitoring will be
applied)
50. Big Data - Are we there yet?
● “3 Vs”: - What are the right questions we want to ask?
○ Volume - How much?
■ Can it run on a single machine in reasonable time?
○ Velocity - How fast?
■ Can a single machine handle the throughput?
○ Variety - What kind? (Difference)
■ Is your data not changing and varying?
● If the answer for most of the previous questions is “Yes”?
Think again if you want to add the complexity of “Big Data”
51. Conclusions
● Think carefully before going into the “Big Data pool”
○ See if you really have a problem that you’re trying to solve
○ It’s not a silver bullet
● Take measures to automate and monitor everything
● Having Clusters and distributed frameworks will cost a lot - eventually
● Fit your storage layer(s) to the needs