Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleDatabricks
Bucketing is commonly used in Hive and Spark SQL to improve performance by eliminating Shuffle in Join or group-by-aggregate scenario. This is ideal for a variety of write-once and read-many datasets at Bytedance.
Parquet performance tuning: the missing guideRyan Blue
Ryan Blue explains how Netflix is building on Parquet to enhance its 40+ petabyte warehouse, combining Parquet’s features with Presto and Spark to boost ETL and interactive queries. Information about tuning Parquet is hard to find. Ryan shares what he’s learned, creating the missing guide you need.
Topics include:
* The tools and techniques Netflix uses to analyze Parquet tables
* How to spot common problems
* Recommendations for Parquet configuration settings to get the best performance out of your processing platform
* The impact of this work in speeding up applications like Netflix’s telemetry service and A/B testing platform
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
Apache Spark is a fast and flexible compute engine for a variety of diverse workloads. Optimizing performance for different applications often requires an understanding of Spark internals and can be challenging for Spark application developers. In this session, learn how Facebook tunes Spark to run large-scale workloads reliably and efficiently. The speakers will begin by explaining the various tools and techniques they use to discover performance bottlenecks in Spark jobs. Next, you’ll hear about important configuration parameters and their experiments tuning these parameters on large-scale production workload. You’ll also learn about Facebook’s new efforts towards automatically tuning several important configurations based on nature of the workload. The speakers will conclude by sharing their results with automatic tuning and future directions for the project.ing several important configurations based on nature of the workload. We will conclude by sharing our result with automatic tuning and future directions for the project.
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
Parallelization of Structured Streaming Jobs Using Delta LakeDatabricks
We’ll tackle the problem of running streaming jobs from another perspective using Databricks Delta Lake, while examining some of the current issues that we faced at Tubi while running regular structured streaming. A quick overview on why we transitioned from parquet data files to delta and the problems it solved for us in running our streaming jobs.
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleDatabricks
Bucketing is commonly used in Hive and Spark SQL to improve performance by eliminating Shuffle in Join or group-by-aggregate scenario. This is ideal for a variety of write-once and read-many datasets at Bytedance.
Parquet performance tuning: the missing guideRyan Blue
Ryan Blue explains how Netflix is building on Parquet to enhance its 40+ petabyte warehouse, combining Parquet’s features with Presto and Spark to boost ETL and interactive queries. Information about tuning Parquet is hard to find. Ryan shares what he’s learned, creating the missing guide you need.
Topics include:
* The tools and techniques Netflix uses to analyze Parquet tables
* How to spot common problems
* Recommendations for Parquet configuration settings to get the best performance out of your processing platform
* The impact of this work in speeding up applications like Netflix’s telemetry service and A/B testing platform
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
Apache Spark is a fast and flexible compute engine for a variety of diverse workloads. Optimizing performance for different applications often requires an understanding of Spark internals and can be challenging for Spark application developers. In this session, learn how Facebook tunes Spark to run large-scale workloads reliably and efficiently. The speakers will begin by explaining the various tools and techniques they use to discover performance bottlenecks in Spark jobs. Next, you’ll hear about important configuration parameters and their experiments tuning these parameters on large-scale production workload. You’ll also learn about Facebook’s new efforts towards automatically tuning several important configurations based on nature of the workload. The speakers will conclude by sharing their results with automatic tuning and future directions for the project.ing several important configurations based on nature of the workload. We will conclude by sharing our result with automatic tuning and future directions for the project.
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
Parallelization of Structured Streaming Jobs Using Delta LakeDatabricks
We’ll tackle the problem of running streaming jobs from another perspective using Databricks Delta Lake, while examining some of the current issues that we faced at Tubi while running regular structured streaming. A quick overview on why we transitioned from parquet data files to delta and the problems it solved for us in running our streaming jobs.
Change Data Feed is a new feature of Delta Lake on Databricks that is available as a public preview since DBR 8.2. This feature enables a new class of ETL workloads such as incremental table/view maintenance and change auditing that were not possible before. In short, users will now be able to query row level changes across different versions of a Delta table.
In this talk we will dive into how Change Data Feed works under the hood and how to use it with existing ETL jobs to make them more efficient and also go over some new workloads it can enable.
Building real time analytics applications using pinot : A LinkedIn case studyKishore Gopalakrishna
LinkedIn's is the most advantageous social networking tool available to job seekers and business professionals today, with 610+ million members creating millions of posts, videos, and articles that generate tens of millions of shares, comments, and likes per day. LinkedIn has leveraged this activity data to build rich interactive user-facing analytics applications like “Who Viewed My Profile”, Talent Insights, Ad Analytics, and Publisher Analytics, among others. These applications are all powered by Pinot, as are internal dashboards, anomaly detection and root cause analysis platform like ThirdEye. This talk will present how Pinot has become the de-facto solution for serving analytic queries in milliseconds, ad-hoc reporting, monitoring & Anomaly Detection on multidimensional data.
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
Apache Spark defaults provide decent performance for large data sets but leave room for significant performance gains if able to tune parameters based on resources and job.
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsDatabricks
This talk will cover some practical aspects of Apache Spark monitoring, focusing on measuring Apache Spark running on cloud environments, and aiming to empower Apache Spark users with data-driven performance troubleshooting. Apache Spark metrics allow extracting important information on Apache Spark’s internal execution. In addition, Apache Spark 3 has introduced an improved plugin interface extending the metrics collection to third-party APIs. This is particularly useful when running Apache Spark on cloud environments as it allows measuring OS and container metrics like CPU usage, I/O, memory usage, network throughput, and also measuring metrics related to cloud filesystems access. Participants will learn how to make use of this type of instrumentation to build and run an Apache Spark performance dashboard, which complements the existing Spark WebUI for advanced monitoring and performance troubleshooting.
Delta Lake, an open-source innovations which brings new capabilities for transactions, version control and indexing your data lakes. We uncover how Delta Lake benefits and why it matters to you. Through this session, we showcase some of its benefits and how they can improve your modern data engineering pipelines. Delta lake provides snapshot isolation which helps concurrent read/write operations and enables efficient insert, update, deletes, and rollback capabilities. It allows background file optimization through compaction and z-order partitioning achieving better performance improvements. In this presentation, we will learn the Delta Lake benefits and how it solves common data lake challenges, and most importantly new Delta Time Travel capability.
The Summer 2016 release of Informatica Cloud is packed with many new platform features including :
- Cloud Data Integration Hub that supports publish and subscribe integration patterns that automate and streamline integration across cloud and on-premise sources
- Innovative features like stateful time sensitive variables, and advanced data transformations like unions and sequences
- Intelligent and dynamic data masking of sensitive data to save development and QA time.
-Cloud B2B Gateway is the leading data exchange platform for enterprises and it’ partners and customers providing end-to-end data monitoring capabilities and support for highest level of data quality.
- Enhancements to native connectors for popular cloud applications like Workday, SAP Success Factors, Oracle, SugarCRM, MongoDB, Teradata Cloud, SAP Concur, Salesforce Financial Services Cloud
And much more!
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
Tomer Shiran est le fondateur et chef de produit (CPO) de Dremio. Tomer était le 4e employé et vice-président produit de MapR, un pionnier de l'analyse du Big Data. Il a également occupé de nombreux postes de gestion de produits et d'ingénierie chez IBM Research et Microsoft, et a fondé plusieurs sites Web qui ont servi des millions d'utilisateurs. Il est titulaire d'un Master en génie informatique de l'Université Carnegie Mellon et d'un Bachelor of Science en informatique du Technion - Israel Institute of Technology.
Le Modern Data Stack meetup est ravi d'accueillir Tomer Shiran. Depuis Apache Drill, Apache Arrow maintenant Apache Iceberg, il ancre avec ses équipes des choix pour Dremio avec une vision de la plateforme de données “ouverte” basée sur des technologies open source. En plus, de ces valeurs qui évitent le verrouillage de clients dans des formats propriétaires, il a aussi le souci des coûts qu’engendrent de telles plateformes. Il sait aussi proposer un certain nombre de fonctionnalités qui transforment la gestion de données grâce à des initiatives telles Nessie qui ouvre la route du Data As Code et du transactionnel multi-processus.
Le Modern Data Stack Meetup laisse “carte blanche” à Tomer Shiran afin qu’il nous partage son expérience et sa vision quant à l’Open Data Lakehouse.
This presentation about HBase will help you understand what is HBase, what are the applications of HBase, how is HBase is different from RDBMS, what is HBase Storage, what are the architectural components of HBase and at the end, we will also look at some of the HBase commands using a demo. HBase is an essential part of the Hadoop ecosystem. It is a column-oriented database management system derived from Google’s NoSQL database Bigtable that runs on top of HDFS. After watching this video, you will know how to store and process large datasets using HBase. Now, let us get started and understand HBase and what it is used for.
Below topics are explained in this HBase presentation:
1. What is HBase?
2. HBase Use Case
3. Applications of HBase
4. HBase vs RDBMS
5. HBase Storage
6. HBase Architectural Components
What is this Big Data Hadoop training course about?
Simplilearn’s Big Data Hadoop training course lets you master the concepts of the Hadoop framework and prepares you for Cloudera’s CCA175 Big data certification. The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Scaling Data Analytics Workloads on DatabricksDatabricks
Imagine an organization with thousands of users who want to run data analytics workloads. These users shouldn’t have to worry about provisioning instances from a cloud provider, deploying a runtime processing engine, scaling resources based on utilization, or ensuring their data is secure. Nor should the organization’s system administrators.
In this talk we will highlight some of the exciting problems we’re working on at Databricks in order to meet the demands of organizations that are analyzing data at scale. In particular, data engineers attending this session will walk away with learning how we:
Manage a typical query lifetime through the Databricks software stack
Dynamically allocate resources to satisfy the elastic demands of a single cluster
Isolate the data and the generated state within a large organization with multiple clusters
Engineering patterns for implementing data science models on big data platformsHisham Arafat
Discussion of practically implementing data science models on big data platforms from engineering perspective. An eye opener on the engineering factors associated with designing and working solution. We use a simple text mining example on social media analytics for brand marketing. At the first while, it seems simple solution however if you go deeply and think on implementation aspects of even a simple analytics model, you can discover the degree of complexity at each part of the solution. An Abstraction of the Big Data key advantages would be very helpful to select appropriate Big Data technology components out of very large landscape. Two examples with reference are given for using Lambda Architecture and unusual way of image processing using Big Data abstraction provided.
Change Data Feed is a new feature of Delta Lake on Databricks that is available as a public preview since DBR 8.2. This feature enables a new class of ETL workloads such as incremental table/view maintenance and change auditing that were not possible before. In short, users will now be able to query row level changes across different versions of a Delta table.
In this talk we will dive into how Change Data Feed works under the hood and how to use it with existing ETL jobs to make them more efficient and also go over some new workloads it can enable.
Building real time analytics applications using pinot : A LinkedIn case studyKishore Gopalakrishna
LinkedIn's is the most advantageous social networking tool available to job seekers and business professionals today, with 610+ million members creating millions of posts, videos, and articles that generate tens of millions of shares, comments, and likes per day. LinkedIn has leveraged this activity data to build rich interactive user-facing analytics applications like “Who Viewed My Profile”, Talent Insights, Ad Analytics, and Publisher Analytics, among others. These applications are all powered by Pinot, as are internal dashboards, anomaly detection and root cause analysis platform like ThirdEye. This talk will present how Pinot has become the de-facto solution for serving analytic queries in milliseconds, ad-hoc reporting, monitoring & Anomaly Detection on multidimensional data.
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
Apache Spark defaults provide decent performance for large data sets but leave room for significant performance gains if able to tune parameters based on resources and job.
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsDatabricks
This talk will cover some practical aspects of Apache Spark monitoring, focusing on measuring Apache Spark running on cloud environments, and aiming to empower Apache Spark users with data-driven performance troubleshooting. Apache Spark metrics allow extracting important information on Apache Spark’s internal execution. In addition, Apache Spark 3 has introduced an improved plugin interface extending the metrics collection to third-party APIs. This is particularly useful when running Apache Spark on cloud environments as it allows measuring OS and container metrics like CPU usage, I/O, memory usage, network throughput, and also measuring metrics related to cloud filesystems access. Participants will learn how to make use of this type of instrumentation to build and run an Apache Spark performance dashboard, which complements the existing Spark WebUI for advanced monitoring and performance troubleshooting.
Delta Lake, an open-source innovations which brings new capabilities for transactions, version control and indexing your data lakes. We uncover how Delta Lake benefits and why it matters to you. Through this session, we showcase some of its benefits and how they can improve your modern data engineering pipelines. Delta lake provides snapshot isolation which helps concurrent read/write operations and enables efficient insert, update, deletes, and rollback capabilities. It allows background file optimization through compaction and z-order partitioning achieving better performance improvements. In this presentation, we will learn the Delta Lake benefits and how it solves common data lake challenges, and most importantly new Delta Time Travel capability.
The Summer 2016 release of Informatica Cloud is packed with many new platform features including :
- Cloud Data Integration Hub that supports publish and subscribe integration patterns that automate and streamline integration across cloud and on-premise sources
- Innovative features like stateful time sensitive variables, and advanced data transformations like unions and sequences
- Intelligent and dynamic data masking of sensitive data to save development and QA time.
-Cloud B2B Gateway is the leading data exchange platform for enterprises and it’ partners and customers providing end-to-end data monitoring capabilities and support for highest level of data quality.
- Enhancements to native connectors for popular cloud applications like Workday, SAP Success Factors, Oracle, SugarCRM, MongoDB, Teradata Cloud, SAP Concur, Salesforce Financial Services Cloud
And much more!
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
Tomer Shiran est le fondateur et chef de produit (CPO) de Dremio. Tomer était le 4e employé et vice-président produit de MapR, un pionnier de l'analyse du Big Data. Il a également occupé de nombreux postes de gestion de produits et d'ingénierie chez IBM Research et Microsoft, et a fondé plusieurs sites Web qui ont servi des millions d'utilisateurs. Il est titulaire d'un Master en génie informatique de l'Université Carnegie Mellon et d'un Bachelor of Science en informatique du Technion - Israel Institute of Technology.
Le Modern Data Stack meetup est ravi d'accueillir Tomer Shiran. Depuis Apache Drill, Apache Arrow maintenant Apache Iceberg, il ancre avec ses équipes des choix pour Dremio avec une vision de la plateforme de données “ouverte” basée sur des technologies open source. En plus, de ces valeurs qui évitent le verrouillage de clients dans des formats propriétaires, il a aussi le souci des coûts qu’engendrent de telles plateformes. Il sait aussi proposer un certain nombre de fonctionnalités qui transforment la gestion de données grâce à des initiatives telles Nessie qui ouvre la route du Data As Code et du transactionnel multi-processus.
Le Modern Data Stack Meetup laisse “carte blanche” à Tomer Shiran afin qu’il nous partage son expérience et sa vision quant à l’Open Data Lakehouse.
This presentation about HBase will help you understand what is HBase, what are the applications of HBase, how is HBase is different from RDBMS, what is HBase Storage, what are the architectural components of HBase and at the end, we will also look at some of the HBase commands using a demo. HBase is an essential part of the Hadoop ecosystem. It is a column-oriented database management system derived from Google’s NoSQL database Bigtable that runs on top of HDFS. After watching this video, you will know how to store and process large datasets using HBase. Now, let us get started and understand HBase and what it is used for.
Below topics are explained in this HBase presentation:
1. What is HBase?
2. HBase Use Case
3. Applications of HBase
4. HBase vs RDBMS
5. HBase Storage
6. HBase Architectural Components
What is this Big Data Hadoop training course about?
Simplilearn’s Big Data Hadoop training course lets you master the concepts of the Hadoop framework and prepares you for Cloudera’s CCA175 Big data certification. The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Scaling Data Analytics Workloads on DatabricksDatabricks
Imagine an organization with thousands of users who want to run data analytics workloads. These users shouldn’t have to worry about provisioning instances from a cloud provider, deploying a runtime processing engine, scaling resources based on utilization, or ensuring their data is secure. Nor should the organization’s system administrators.
In this talk we will highlight some of the exciting problems we’re working on at Databricks in order to meet the demands of organizations that are analyzing data at scale. In particular, data engineers attending this session will walk away with learning how we:
Manage a typical query lifetime through the Databricks software stack
Dynamically allocate resources to satisfy the elastic demands of a single cluster
Isolate the data and the generated state within a large organization with multiple clusters
Engineering patterns for implementing data science models on big data platformsHisham Arafat
Discussion of practically implementing data science models on big data platforms from engineering perspective. An eye opener on the engineering factors associated with designing and working solution. We use a simple text mining example on social media analytics for brand marketing. At the first while, it seems simple solution however if you go deeply and think on implementation aspects of even a simple analytics model, you can discover the degree of complexity at each part of the solution. An Abstraction of the Big Data key advantages would be very helpful to select appropriate Big Data technology components out of very large landscape. Two examples with reference are given for using Lambda Architecture and unusual way of image processing using Big Data abstraction provided.
As part of this session, I will be giving an introduction to Data Engineering and Big Data. It covers up to date trends.
* Introduction to Data Engineering
* Role of Big Data in Data Engineering
* Key Skills related to Data Engineering
* Role of Big Data in Data Engineering
* Overview of Data Engineering Certifications
* Free Content and ITVersity Paid Resources
Don't worry if you miss the video - you can click on the below link to go through the video after the schedule.
https://youtu.be/dj565kgP1Ss
* Upcoming Live Session - Overview of Big Data Certifications (Spark Based) - https://www.meetup.com/itversityin/events/271739702/
Relevant Playlists:
* Apache Spark using Python for Certifications - https://www.youtube.com/playlist?list=PLf0swTFhTI8rMmW7GZv1-z4iu_-TAv3bi
* Free Data Engineering Bootcamp - https://www.youtube.com/playlist?list=PLf0swTFhTI8pBe2Vr2neQV7shh9Rus8rl
* Join our Meetup group - https://www.meetup.com/itversityin/
* Enroll for our labs - https://labs.itversity.com/plans
* Subscribe to our YouTube Channel for Videos - http://youtube.com/itversityin/?sub_confirmation=1
* Access Content via our GitHub - https://github.com/dgadiraju/itversity-books
* Lab and Content Support using Slack
H2O Deep Water - Making Deep Learning Accessible to EveryoneSri Ambati
Deep Water is H2O's integration with multiple open source deep learning libraries such as TensorFlow, MXNet and Caffe. On top of the performance gains from GPU backends, Deep Water naturally inherits all H2O properties in scalability. ease of use and deployment. In this talk, I will go through the motivation and benefits of Deep Water. After that, I will demonstrate how to build and deploy deep learning models with or without programming experience using H2O's R/Python/Flow (Web) interfaces.
Jo-fai (or Joe) is a data scientist at H2O.ai. Before joining H2O, he was in the business intelligence team at Virgin Media in UK where he developed data products to enable quick and smart business decisions. He also worked remotely for Domino Data Lab in the US as a data science evangelist promoting products via blogging and giving talks at meetups. Joe has a background in water engineering. Before his data science journey, he was an EngD research engineer at STREAM Industrial Doctorate Centre working on machine learning techniques for drainage design optimization. Prior to that, he was an asset management consultant specialized in data mining and constrained optimization for the utilities sector in the UK and abroad. He also holds an MSc in Environmental Management and a BEng in Civil Engineering.
Simplifying Building Automation: Leveraging Semantic Tagging with a New Breed...Memoori
Memoori's 10th Webinar in the 2019 Smart Buildings Series. We spoke with Chris Irwin, VP Sales EMEA & Asia at J2 Innovations about the FIN 5 software framework and “Simplifying Building Automation by Leveraging Semantic Tagging with a New Breed of Software”.
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
http://www.bigdataspain.org/2014/conference/state-of-play-data-science-on-hadoop-in-2015-keynote
Machine Learning is not new. Big Machine Learning is qualitatively different: More data beats algorithm improvement, scale trumps noise and sample size effects, can brute-force manual tasks.
Session presented at Big Data Spain 2014 Conference
18th Nov 2014
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Slides: https://speakerdeck.com/bigdataspain/state-of-play-data-science-on-hadoop-in-2015-by-sean-owen-at-big-data-spain-2014
ABSTRACT: The ongoing big data revolution has revolutionized the way in which technology is used to empower new business segments like social networking and transform old business segments like traditional retail. However, the DNA that is used to build data processing platform is evolving quite rapidly. There is a plethora of competing tools, technologies, and “religion” for how to build state-of-the-art data analysis frameworks. In this talk, I will go over five ways to build scalable high-performance long-lasting data analysis frameworks in the wrong way. Surprisingly, the industry is full of examples of organization building frameworks in this “wrong” way. Since the “right” way to build a technology framework is dependent on the key business drivers, it is my hope that this talk will spur a discussion on what is the “right” way for Pinterest. The talk will focus on technologies including “data plumbing” (e.g. tools in the Hadoop ecosystem), and statistical modeling methods (e.g. R and Python). In this talk, I’ll try to connect to platform builders, data scientists, and business decision makers.
BIO: Jignesh Patel is a Professor in Computer Sciences at the University of Wisconsin-Madison, where he also earned his Ph.D. He has worked in the area of databases (now fashionably called “big data”) for over two decades. He has won several best paper awards, and industry research awards. He is the recipient of the Wisconsin COW teaching award, and the U. Michigan College of Engineering Education Excellence Award. He has a strong interest in seeing research ideas transition to actual products. His Ph.D. thesis work was acquired by NCR/Teradata in 1997, and he also co-founded Locomatix -- a startup that built a platform to power real-time data-driven mobile services. Locomatix became part of Twitter in 2013. He is an ACM Distinguished Scientist and an IEEE Senior Member. He also serves on the board of Lands’ End, and advises a number of startups.
Agile Data Rationalization for Operational IntelligenceInside Analysis
The Briefing Room with Eric Kavanagh and Phasic Systems
Live Webcast Mar. 26, 2013
The complexity of today's information architectures creates a wide range of challenges for executives trying to get a strategic view of their current operations. The data and context locked in operational systems often get diluted during the normalization processes of data warehousing and other types of analytic solutions. And the ultimate goal of seeing the big picture gets derailed by a basic inability to reconcile disparate organizational views of key information assets and rules.
Register for this episode of The Briefing Room to learn from Bloor Group CEO Eric Kavanagh, who will explain how a tightly controlled methodology can be combined with modern NoSQL technology to resolve both process and system complexities, thus enabling a much richer, more interconnected information landscape. Kavanagh will be briefed by Geoffrey Malafsky of Phasic Systems who will share his company's tested methodology for capturing and managing the business and process logic that run today's data-driven organizations. He'll demonstrate how a “don't say no” approach to entity definitions can dissolve previously intractable disagreements, opening the door to clear, verifiable operational intelligence.
Visit: http://www.insideanalysis.com
Making good command decisions today is more and more being underpinned by the use of data and the insights that the data can deliver. In our rapidly changing world, we start to find that the pure volume of data becomes overwhelming. This volume of data can lead to indecision instead of better decision making. In this session, we will cover how through the use of artificial intelligence, machine learning and intelligent data routing we can enhance and support the decision-making process in times of crisis.
What connects BMW’s ultimate driving machines and IoT? Take a look at what went down at HARMAN’s Connected Services’ event, at a BMW performance track and understand how Data, Device and Design; the three key dimensions of disruption are revolutionizing different industries.
Testing Strategies to Deliver Consistent App Performance HARMAN Services
Stop gambling with your application performance. Know how continuous testing processes and strategies can help you deliver better app performance during Grand National and Seasonal spikes.
How to Manage APIs in your Enterprise for Maximum Reusability and GovernanceHARMAN Services
Trying to form an API/service strategy to keep pace with the IoT revolution? Know how you can address issues and challenges your enterprise might face while implementing it and know how you can address the same.
This webinar will also explains how WSO2 API Manager and WSO2 Governance Registry have helped enterprises overcome the following challenges:
1. How the number of services and their users affect service discoverability, catalog, and re-usability.
2. Mistrust among producers and consumers
3. Reliability, stability, and availability of services
4. How externally built common and reusable services meet requirements (anti-patterns - NIH)
Webinar - Transforming Manufacturing with IoTHARMAN Services
The Manufacturing industry is realizing the tremendous benefits in the “Internet of Things” (IoT), an inevitable evolution to traditional M2M solutions. Innovations across embedded devices, advanced analytics, and enriched user experiences all powered by cloud, has enabled new opportunities for both perpetual revenue and perpetual customer value. In this session we will break down benefits of IoT for Manufacturing with real-world examples.
How enterprises in the travel business are successfully navigating their digital transformation strategy and interacting with their customers across every touch-point.
The expectations of the digital customer are rising, how are you keeping up with it? Check out these 3 power moves which all the CEOs in the media industry are using to navigate digital transformation.
Ladbrokes and Aditi - Digital Transformation Case study HARMAN Services
Your digital customer is evolving and digital engagement is evolving even faster! See how Aditi digitally transformed Labrokes' business to give them an edge over the competition.
How Internet of Things (IoT) is Reshaping the Automotive Sector - InfographicHARMAN Services
"The cars from science fiction movies are coming to reality. Customers are waiting to discover their own car based “siri” or “cortana” or dare we say “Jarvis” (for the inner Iron Man in them).
In a software defined world, silicon valley seems to be at par with Detroit in innovating car based experiences through internet of things (IoT). Starting with Google, and its pioneering effort in rewriting the rules of driverless cars, connected automobiles are steadily making their way into automotive market.
And the future is set for the IoT connected cars to lead and thrive in this innovation hungry world. Park Associates estimates that a whopping 78% car owners will demand connected features in their next vehicle.
We created an infographic which lucidly shows how connected cars are going to shape the future of driving and transport. Check it out! "
Analyzing Gartner's CIO Study: Fliping to Digital Leadership HARMAN Services
Why CIOs Need To “Flip” From Old To New In Terms Of Information And Technology Leadership, Value Leadership And People Leadership. We analyzed and decoded Gartner's latest report to find out more!
24 Connected Car features to look out for before the release of Bond 24HARMAN Services
What if connected cars and internet of things got casted in the movie 'Bond 24'. We did some investigation on our own and came up with a list of these 24 cool connected car features worthy of our favorite spy.
Your customers are more connected than ever and they are interacting with your brand across channels. Find out how you can implement an OMNI-channel engagement strategy with cloud and big-data.
Customer Experience Trumps Everything. What happens when you put a 100 UX designers in a room and scout for ideas? Take our word, it’s an ‘experience’ of a lifetime.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
1. Learning and Development Be part of the learning experience at Aditi.
presents
Join the talks. Its free.
Free as in freedom at work, not free-beer.
Its not training. Its mind-opener.
Speak at these events. Or bring an
expert/friend to talk.
Open Talk Series
Mail OpenTalk@aditi.com with topic and
A series of illuminating talks and
interactions that open our minds to new availability.
ideas and concepts; that makes us look for
newer or better ways of doing what we
did; or point us to exciting things we have
never done before. A range of topics on
Technology, Business, Fun and Life.
2. HOW TO ENJOY AN TALK
Bring coffee & friends Switch OFF mobile Switch ON mind
Sign attendance sheet SHARE your wisdom QUESTION notions
THANK the Talker SPREAD the good word
4. facebook in 20 Minutes
• 2.7 M Photos
• 10.2M Comments
• 4.6 Messages
Statistics
What is Facebook
• Shared links: 1,000,000
Technical challenges • Tagged photos: 1,323,000
Front End
• Event invites sent out: 1,484,000
Data arch
Services architecture • Wall Posts: 1,587,000
• Status updates: 1,851,000
• Friend requests accepted: 1,972,000
• Photos uploaded: 2,716,000
• Comments: 10,208,000
• Message: 4,632,000
5. facebook in 20 Minutes
Direct Friendship
Statistics
What is Facebook
Technical challenges
Front End
Data arch
Services architecture
Friends of Friends
6. What is facebook
• A social graph
• Friends , Friends of friends, somewhere in the
network.
• Friends can comment, like, read your posts
• Friends of friends can just read
Statistics
What is Facebook
Technical challenges • Facebook messages – chat/ email/ SMS
Front End
• Near real-time updates
Data arch
Services architecture
7. Technical Challenges
Challenges Ok to Live with
Statistics
What is Facebook
• High • Not Mission
Technical challenges
Front End
Concurrency Critical
Data arch
• High Data • Cached data is
Services architecture
Volumes fine
• Multilevel • Write Failures
Hierarchical are tolerable
data
8. The Data – (Illustrational)
Everything is a hash lookup
User Friend User Age Bio Intere
ID s with Name sts
Statistics 1 2,3,4 XYZ .. .. ..
What is Facebook 2 1 .. .. .. ..
Technical challenges
Challenges Solutions
Front End
Data arch
Services architecture
The Relational Nature of the data No Constraints, No Joins in MySQL
Data Volumes Write Through cache implementation
Concurrency Hash Ring based architecture
9. facebook – Data Partition initial thoughts
• Horizontal partitioning based on
Networks.
– Harvard
Statistics – Stanford
What is Facebook
Technical challenges
– Carnegie
Front End
Data arch
Services architecture
10. facebook –Photos - HayStack
• Each File read required a minimum
of 3 i/o in a typical file system
• CDNs- Not a Solution
• Haystack is a customized storage
Statistics
system, which minimizes the
What is Facebook
Technical challenges
Front End
amount of File metadata and
involves only 1 i/o for each file
Data arch
Services architecture
read.
• Haystack caches extensive data in
in its main memory
11. facebook – HayStack
Statistics HayStack Interface
HayStack HayStack
What is Facebook
Cache Directory
Technical challenges
Front End
Data arch Logical Drives Logical Drives
Services architecture
PD PD PD PD PD PD
http://CDN/Cache/Machine id/(Logical volume, Photo)
12. Facebook – Serving the Photo - Haystack
Statistics
What is Facebook
Technical challenges
Front End
Data arch
Services architecture
13. Facebook – Scribe - Logging
Nodes Nodes Nodes
Scribe Scribe Scribe
Statistics
What is Facebook
Technical challenges
Front End $messages = array();
$entry = new LogEntry;
Data arch
Central Scribe Server $entry->category = "buckettest";
Services architecture $entry->message = "something very”;
$messages []= $entry;
$result = $conn->Log($messages);
Dashboards
HBase
14. facebook – Services– Thrift
• Lightweight Software framework for cross-
language development
• Dev need not worry about serialization ,
connection handling and threading
• Supported bindings:
Statistics
What is Facebook
Technical challenges – C++, PHP, Python, java, ruby, erlang, perl,
Front End haskell
• Transports : Simple interface to i/o
Data arch
Services architecture
• Protocols : Serialization format
– TBinaryProtocol, TJsonProtocol
• Severs
– Non Blocking, Async, Single threaded, multi-
threaded
15. facebook – Memcache
• In-memory distributed hash table
• “hot” data from MySQL stored in cache
Statistics
What is Facebook
Technical challenges
Front End
Data arch
Services architecture
16. facebook – front end - PHP
• Op – Code Optimization
• APC improvements(alternate PHP cache)
– Lazy Loading
– Cache priming
Statistics
• Custom Extensions
What is Facebook
Technical challenges
– Memcache Client Extension
Front End – Serialization format
Data arch
– Logging, Stats Collection, Monitoring
Services architecture
– Asynchronous event-handling mechanism
17. facebook – front end – Hip Hop
• Source Code Transformer
• Static Analysis, type inference, Code
Generation
Statistics
• Easier to write extensions
What is Facebook
Technical challenges • Significantly cuts down on CPU and
Memory usage
Front End
Data arch
Services architecture
18. facebook – front end – Hip Hop
Statistics
What is Facebook
Technical challenges
Front End
Data arch
Services architecture
19. facebook – front end – BigPipe
BigPipe first breaks web pages into multiple chunks called pagelets
Statistics
What is Facebook
Technical challenges
Front End
Data arch
Services architecture
20. facebook – front end – BigPipe
BigPipe first breaks web pages into multiple chunks called pagelets
Request Parsing
Web Server parses and sanity checks the request
Data Fetching
Web Server fetches data from storage tier
Statistics
What is Facebook Markup Generation
Web server generates HTML Markup
Technical challenges
Front End Network Transport
Response is transferred
Data arch
Services architecture
CSS downloading
Dom Tree Construction
JavaScript downloading
JS Execution
21. facebook – Technology Stack
Front End Big Pipe Hip Hop
PHP - Custom compiler / Cache implementations
Linux – Custom Kernel Extensions
Service Aggregators
Scribe
Thrift
Service 1 Service 2 Service 3 Service 4
Data Store
MemCache – Write Through Cache implementation
Cassandra MySQL HBase HayStack
22. facebook – Messages Infrastructure
Statistics
What is Facebook
Technical challenges
Front End
Data arch
Services architecture
Messages
25. facebook – Cells
Cell
Node
1
Statistics
What is Facebook
Node
Technical challenges Node2
n Zookeper
Front End Controller
Data arch Machines
Services architecture
Messages Node Node
4 3
Application Server Cluster
Metadata Store
26. facebook – Cells
• They help scale incrementally while
limiting failure scenarios
• Easy upgrades
Statistics
What is Facebook
• Metadata store failures affect only a few
Technical challenges
users
Front End
Data arch
Services architecture
• Easy rollout
Messages
• Flexibility to host cells in different data
centers with multi-homing for disaster
recovery
27. Take away – for our applications
• Really parallel Asynchronous AJAX Pages
– ASP.Net Update panels is a HOAX
• Appropriate usage of client side technology
• Cache – Cache – Cache
– Write Through Caches are way better
– App Fabric cache/ Memcache
• High – Normalization is not needed
– Store denormalized views – materialized views
• Parallel Services and Service aggregators
• Fault tolerant applications
• Asynchronous Processing
• 1 Sec response time is too SLOW