This document provides an overview of Spark and using Spark on HDInsight. It discusses Spark concepts like RDDs, transformations, and actions. It also covers Spark extensions like Spark SQL, Spark Streaming, and MLlib. Finally, it highlights benefits of using Spark on HDInsight like integration with Azure services, scalability, and support.
Structured Streaming for Columnar Data Warehouses with Jack GudenkaufDatabricks
Organizations building big data analytics solutions for streaming environments struggle with adapting legacy batch systems for streaming, supporting multiple columnar analytical databases, providing time series aggregations, and streaming Fact and Dimensional data into star schemas. In this session, you will learn how we overcame these challenges and developed an end-user self-service, no-code required “ETL” framework. Extensible and operationally robust, this developer framework includes a Spark Structured Streaming app for Kafka, Hadoop/Hive (ORC, Parquet), OpenTSDB/HBase, and Vertica data pipelines.
Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...Databricks
It is common for consumer Internet companies to start off with popular third-party tools for analytics needs. Then, when the user base and the company grows, they end up building their own analytics data pipeline and query engine to cope with their data scale, satisfy custom data enrichment and reporting needs and achieve high quality of their data. That’s exactly the path that was taken at Grammarly, the popular online proofreading service.
In this session, Grammarly will share how they improved business and marketing analytics, previously done with Mixpanel, by building their own in-house analytics engine and application on top of Apache Spark. Chernetsov wil touch upon several Spark tweaks and gotchas that they experienced along the way:
– Outputting data to several storages in a single Spark job
– Dealing with Spark memory model, building a custom spillable data-structure for your data traversal
– Implementing a custom query language with parser combinators on top of Spark sql parser
– Custom query optimizer and analyzer when you want not exactly sql
– Flexible-schema storage and query against multi-schema data with schema conflicts
– Custom aggregation functions in Spark SQL
Video to talk: https://www.youtube.com/watch?v=gd4Jqtyo7mM
Apache Spark is a next generation engine for large scale data processing built with Scala. This talk will first show how Spark takes advantage of Scala's function idioms to produce an expressive and intuitive API. You will learn about the design of Spark RDDs and the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases(Spark SQL, Spark Streaming, MLib and GraphX). The Spark source will be be referenced to illustrate how these concepts are implemented with Scala.
http://www.meetup.com/Scala-Bay/events/209740892/
Structured Streaming for Columnar Data Warehouses with Jack GudenkaufDatabricks
Organizations building big data analytics solutions for streaming environments struggle with adapting legacy batch systems for streaming, supporting multiple columnar analytical databases, providing time series aggregations, and streaming Fact and Dimensional data into star schemas. In this session, you will learn how we overcame these challenges and developed an end-user self-service, no-code required “ETL” framework. Extensible and operationally robust, this developer framework includes a Spark Structured Streaming app for Kafka, Hadoop/Hive (ORC, Parquet), OpenTSDB/HBase, and Vertica data pipelines.
Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...Databricks
It is common for consumer Internet companies to start off with popular third-party tools for analytics needs. Then, when the user base and the company grows, they end up building their own analytics data pipeline and query engine to cope with their data scale, satisfy custom data enrichment and reporting needs and achieve high quality of their data. That’s exactly the path that was taken at Grammarly, the popular online proofreading service.
In this session, Grammarly will share how they improved business and marketing analytics, previously done with Mixpanel, by building their own in-house analytics engine and application on top of Apache Spark. Chernetsov wil touch upon several Spark tweaks and gotchas that they experienced along the way:
– Outputting data to several storages in a single Spark job
– Dealing with Spark memory model, building a custom spillable data-structure for your data traversal
– Implementing a custom query language with parser combinators on top of Spark sql parser
– Custom query optimizer and analyzer when you want not exactly sql
– Flexible-schema storage and query against multi-schema data with schema conflicts
– Custom aggregation functions in Spark SQL
Video to talk: https://www.youtube.com/watch?v=gd4Jqtyo7mM
Apache Spark is a next generation engine for large scale data processing built with Scala. This talk will first show how Spark takes advantage of Scala's function idioms to produce an expressive and intuitive API. You will learn about the design of Spark RDDs and the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases(Spark SQL, Spark Streaming, MLib and GraphX). The Spark source will be be referenced to illustrate how these concepts are implemented with Scala.
http://www.meetup.com/Scala-Bay/events/209740892/
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...DataStax
At Knewton we operate across five different VPCs a total of 29 clusters, each ranging from 3 nodes to 24 nodes. For a team of three to maintain this is not herculean, however good tools to diagnose issues and gather information in a distributed manner are vital to moving quickly and minimizing engineering time spent.
The database team at Knewton has been successfully using a combination of Ansible and custom open sourced tools to maintain and improve the Cassandra deployment at Knewton. I will be talking about several of these tools and giving examples of how we are using them. Specifically I will discuss the cassandra-tracing tool, which analyzes the contents of the system_traces keyspace, and the cassandra-stat tool, which gives real-time output of the operations of a cassandra cluster. Distributed administration with ad-hoc Ansible will also be covered and I will walk through examples of using these commands to identify and remediate clusterwide issues.
About the Speaker
Jeffrey Berger Lead Database Engineer, Knewton
Dr. Jeffrey Berger is currently the lead database engineer at Knewton, an education tech startup in NYC. He joined the tech scene in NYC in 2013 and spent two years working with MongoDB, becoming a certified MongoDB administrator and a MongoDB Master. He received his Cassandra Administrator certification at Cassandra Summit 2015. He holds a Ph.D. in Theoretical Physics from Penn State and spent several years working on high energy nuclear interactions.
Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.
We present some of the lessons learned while designing faster indexes, with a particular emphasis on compressed bitmap indexes. Compressed bitmap indexes accelerate queries in popular systems such as Apache Spark, Git, Elastic, Druid and Apache Kylin.
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Data Con LA
After a brief technical introduction to Apache Cassandra we'll then go into the exciting world of Apache Spark integration, and learn how you can turn your transactional datastore into an analytics platform. Apache Spark has taken the Hadoop world by storm (no pun intended!), and is widely seen as the replacement to Hadoop Map Reduce. Apache Spark coupled with Cassandra are perfect allies, Cassandra does the distributed data storage, Spark does the distributed computation.
Spark Streaming makes it easy to build scalable fault-tolerant streaming applications. In this webinar, developers will learn:
*How Spark Streaming works - a quick review.
*Features in Spark Streaming that help prevent potential data loss.
*Complementary tools in a streaming pipeline - Kafka and Akka.
*Design and tuning tips for Reactive Spark Streaming applications.
Using Spark to Load Oracle Data into CassandraJim Hatcher
This presentation describes how you can use Spark as an ETL tool to get data from a relational database into Cassandra. I go through the concept in general and then talk about some specific issues you might run into and how to fix them.
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
Of all the developers’ delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs-RDDs, DataFrames, and Datasets-available in Apache Spark 2.x. In particular, I will emphasize three takeaways: 1) why and when you should use each set as best practices 2) outline its performance and optimization benefits; and 3) underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you’ll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them. (this will be vocalization of the blog, along with the latest developments in Apache Spark 2.x Dataframe/Datasets and Spark SQL APIs: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0Denodo
In this presentation, you will see the new functionalities of the Denodo 6.0 detailing dynamic query optimization engine, managing enterprise deployments, and using information self-service for discovery and search.
This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/DzRtkg.
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...DataStax
At Knewton we operate across five different VPCs a total of 29 clusters, each ranging from 3 nodes to 24 nodes. For a team of three to maintain this is not herculean, however good tools to diagnose issues and gather information in a distributed manner are vital to moving quickly and minimizing engineering time spent.
The database team at Knewton has been successfully using a combination of Ansible and custom open sourced tools to maintain and improve the Cassandra deployment at Knewton. I will be talking about several of these tools and giving examples of how we are using them. Specifically I will discuss the cassandra-tracing tool, which analyzes the contents of the system_traces keyspace, and the cassandra-stat tool, which gives real-time output of the operations of a cassandra cluster. Distributed administration with ad-hoc Ansible will also be covered and I will walk through examples of using these commands to identify and remediate clusterwide issues.
About the Speaker
Jeffrey Berger Lead Database Engineer, Knewton
Dr. Jeffrey Berger is currently the lead database engineer at Knewton, an education tech startup in NYC. He joined the tech scene in NYC in 2013 and spent two years working with MongoDB, becoming a certified MongoDB administrator and a MongoDB Master. He received his Cassandra Administrator certification at Cassandra Summit 2015. He holds a Ph.D. in Theoretical Physics from Penn State and spent several years working on high energy nuclear interactions.
Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.
We present some of the lessons learned while designing faster indexes, with a particular emphasis on compressed bitmap indexes. Compressed bitmap indexes accelerate queries in popular systems such as Apache Spark, Git, Elastic, Druid and Apache Kylin.
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Data Con LA
After a brief technical introduction to Apache Cassandra we'll then go into the exciting world of Apache Spark integration, and learn how you can turn your transactional datastore into an analytics platform. Apache Spark has taken the Hadoop world by storm (no pun intended!), and is widely seen as the replacement to Hadoop Map Reduce. Apache Spark coupled with Cassandra are perfect allies, Cassandra does the distributed data storage, Spark does the distributed computation.
Spark Streaming makes it easy to build scalable fault-tolerant streaming applications. In this webinar, developers will learn:
*How Spark Streaming works - a quick review.
*Features in Spark Streaming that help prevent potential data loss.
*Complementary tools in a streaming pipeline - Kafka and Akka.
*Design and tuning tips for Reactive Spark Streaming applications.
Using Spark to Load Oracle Data into CassandraJim Hatcher
This presentation describes how you can use Spark as an ETL tool to get data from a relational database into Cassandra. I go through the concept in general and then talk about some specific issues you might run into and how to fix them.
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
Of all the developers’ delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs-RDDs, DataFrames, and Datasets-available in Apache Spark 2.x. In particular, I will emphasize three takeaways: 1) why and when you should use each set as best practices 2) outline its performance and optimization benefits; and 3) underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you’ll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them. (this will be vocalization of the blog, along with the latest developments in Apache Spark 2.x Dataframe/Datasets and Spark SQL APIs: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0Denodo
In this presentation, you will see the new functionalities of the Denodo 6.0 detailing dynamic query optimization engine, managing enterprise deployments, and using information self-service for discovery and search.
This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/DzRtkg.
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...Nicolas Kourtellis
A general overview of the APACHE SAMOA platform for mining big data streams using machine learning algorithms running on distributed stream processing platforms such as Apache STORM, Apache Flink, Apache Samza and Apache Apex.
Results are shown from experimentation with VHT, the Vertical Hoeffding Tree proposed in "VHT: Vertical Hoeffding Tree." N. Kourtellis, G. De Francisci Morales, A. Bifet, A. Mordupo. IEEE BigData 2016.
Presentation in APACHE BIG DATA North America 2016
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)Nicolas Kourtellis
A general overview of the APACHE SAMOA platform for mining big data streams using machine learning algorithms running on distributed stream processing platforms such as Apache STORM, Apache Flink, Apache Samza and Apache Apex.
Results are shown from experimentation with VHT, the Vertical Hoeffding Tree proposed in "VHT: Vertical Hoeffding Tree." N. Kourtellis, G. De Francisci Morales, A. Bifet, A. Mordupo. IEEE BigData 2016.
Presentation in APACHE BIG DATA Europe 2015
Презентация Виталия Никитина о возомжностях платформы HPE Idol для работы с BigData в современном кол-центре. Аналитика аудио и текстовой информации на базе платформы HPE IDOL
ANTS - 360 view of your customer - bigdata innovation summit 2016
Dữ liệu lớn, theo nghĩa đen, không phải là một cái gì đó mới, nó đã hiện hữu trong một thời gian dài. Bất kỳ một công ty đã kinh doanh trong một vài năm đều sở hữu một khối lượng dữ liệu từ thông tin khách hàng thông qua hồ sơ giao dịch và sử dụng sản phẩm. Những doanh nghiệp khác nhau sẽ có những khả năng khai thác và sử dụng dữ liệu lớn khác nhau.
Một số doanh nghiệp đã đạt đến độ chín muồi, trong khi đó một số doanh nghiệp khác chỉ vừa bắt đầu cuộc hành trình. Tại Việt Nam các doanh nghiệp viễn thông, các hãng hàng không, các ngân hàng, các tập đoàn bán lẻ, các cơ quan chính phủ đang sở hữu một khối lượng dữ liệu khổng lồ, tuy nhiên, việc xử lý, phân tích dữ liệu lớn còn đang trong giai đoạn rất sơ khai.
Diễn đàn dữ liệu lớn sẽ tập trung chia sẻ những kiến thức thực tế và mang tính chất ứng dụng để người tham dự có thể tích luỹ và áp dụng.
ITpro Active主催「ビッグデータはクラウドで操るBigData Platform Conference~IoT時代を勝ち抜くためのDataBase as a Service活用法~」<11月11日(水)開催>資料
実際のお客様事例をベースにインフラエンジニア、業務担当者、データサイエンティスト、マーケティングのペルソナをベースにそれぞれのシナリオでのクラウド環境を活用したデータ分析について講演
Oxalide MorningTech #1 - BigData
1er MorningTech @Oxalide, animé par Ludovic Piot (@lpiot), le 15 décembre 2016.
Pour cette 1ère édition du Morning Tech nous vous proposons une overview sur un des thèmes du moment : le Big Data.
Au delà de ce buzz word nous aborderons :
Les grands concepts
Les étapes clés des projets Big Data et les technologies à utiliser (stockage, ingestion, …)
Les enjeux des architectures Big Data (architecture lambda, …)
L'intelligence artificielle (machine learning, deep learning, …)
Et nous finirons par un cas d'usage du big data sur AWS autour de l'utilisation des données gyroscopiques de vos internautes mobiles
Subject: Oxalide's 1st MorningTech talk about BigData.
Date: 15-dec-2016
Speakers: Ludovic Piot (@lpiot, @oxalide)
Language: french
Lien SpeakerDeck : https://speakerdeck.com/lpiot/oxalide-morningtech-number-1-bigdata
Lien SlideShare : https://www.slideshare.net/LudovicPiot/oxalide-morningtech-1-bigdata
YouTube Video capture: https://youtu.be/7O85lRzvMY0
Main topics:
* Les grands enjeux du BigData
** les 3 V du Gartner : volume, variété, vélocité
* Le stockage des données
** datalake
** les technos
* L'ingestion des données
** ETL
** datastream
** les technos
* Les enjeux du compute
** map-reduce
** spark
** lambda architecture
* Démo d'une plateforme BigData sur AWS
* L'intelligence artificielle
** datascience exploratoire et notebooks,
** machine learning,
** deep learning,
** data pipeline
** les technos
* Pour aller plus loin
** La gouvernance des données
** La dataviz
BigData HUB is a non-profit organization that help to spread Big Data and Data Science technology around Egyptian universities and Globally.
https://www.facebook.com/BigDataHub
위 자료는 BOAZ 2016 하반기 프로젝트 주제의 하나로, Advanced 정규세션 동안 Base 정규세션에서 배웠던 다양한 이론들과 기본 지식들, 그리고 툴 활용능력들을 직접 실행하며 진행한 결과물입니다.
*** 서울시 2030 나홀로족을 위한 라이프 가이드북 ***
서울에 거주하는 2030 나홀로족을 위해 제작된 라이프 가이드북. 이 가이드북의 주목적은 먹는 것(식) 그리고 사는 것(주)에 대해서 그에 관한 정보를 주는 것임.
6기 김승효 중앙대학교 응용통계학과
6기 김재은 이화여자대학교 시각디자인과
7기 박다혜 한국외국어대학교 통계학과
** 국내 최초 대학생 빅데이터 연합동아리 BOAZ **
Blog : http://BOAZbigdata.com
Facebook : http://fb.com/BOAZbigdata
Java is a high-level programming language developed by Sun Microsystems but later was taken over by Oracle. This tutorial gives a basic understanding on Polymorphism.
Jump Start into Apache® Spark™ and DatabricksDatabricks
These are the slides from the Jump Start into Apache Spark and Databricks webinar on February 10th, 2016.
---
Spark is a fast, easy to use, and unified engine that allows you to solve many Data Sciences and Big Data (and many not-so-Big Data) scenarios easily. Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning, and graph processing. We will leverage Databricks to quickly and easily demonstrate, visualize, and debug our code samples; the notebooks will be available for you to download.
Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed:
• What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them?
• When to use batch and when stream processing?
• What is a Lambda-Architecture and a Kappa Architecture?
• What are the best practices for your project?
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
Apache Spark 2.0 has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
What’s new in Spark 2.0
SparkSessions vs SparkContexts
Datasets/Dataframes and Spark SQL
Introduction to Structured Streaming concepts and APIs
Founding committer of Spark, Patrick Wendell, gave this talk at 2015 Strata London about Apache Spark.
These slides provides an introduction to Spark, and delves into future developments, including DataFrames, Datasource API, Catalyst logical optimizer, and Project Tungsten.
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
This session of the workshop introduces Spark SQL along with DataFrames, Datasets. Datasets give us the ability to easily intermix relational and functional style programming. So that we can explore the new Dataset API this iteration will be focused in Scala.
Machine Learning with H2O, Spark, and Python at Strata 2015Sri Ambati
Machine Learning with H2O, Spark, and Python at Strata SJ 2015-by Cliff Click and Michal Malohlava
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
“As Apache Spark becomes more widely adopted, we have focused on creating higher-level APIs that provide increased opportunities for automatic optimization. In this talk, I give an overview of some of the exciting new API’s available in Spark 2.0, namely Datasets and Structured Streaming. Together, these APIs are bringing the power of Catalyst, Spark SQL's query optimizer, to all users of Spark. I'll focus on specific examples of how developers can build their analyses more quickly and efficiently simply by providing Spark with more information about what they are trying to accomplish.” - Michael
Databricks Blog: "Deep Dive into Spark SQL’s Catalyst Optimizer"
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
// About the Presenter //
Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.
Follow Michael on -
Twitter: https://twitter.com/michaelarmbrust
LinkedIn: https://www.linkedin.com/in/michaelarmbrust
Similar to SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big Data analytics with Microsoft Azure" (20)
SE2016 Company Development Valentin Dombrovsky "Travel startups challenges an...Inhacking
Event: #SE2016
Stage: Company Development
Data: 3 of September 2016
Speaker: Valentin Dombrovsky
Topic: Travel startups challenges and opportunities
INHACKING site: https://inhacking.com
SE2016 site: http://se2016.inhacking.com/
SE2016 Company Development Vadym Gorenko "How to pass the death valley"Inhacking
Event: #SE2016
Stage: Company Development
Data: 3 of September 2016
Speaker: Jan Keil
Topic: Do the right thing marketing for startups
INHACKING site: https://inhacking.com
SE2016 site: http://se2016.inhacking.com/
SE2016 Marketing&PR Jan Keil "Do the right thing marketing for startups"Inhacking
Event: #SE2016
Stage: Marketing&PR
Data: 3 of September 2016
Speaker: Jan Keil
Topic: Do the right thing marketing for startups
INHACKING site: https://inhacking.com
SE2016 site: http://se2016.inhacking.com/
SE2016 PR&Marketing Mikhail Patalakha "ASO how to start and how to finish"Inhacking
Event: #SE2016
Stage: PR&Marketing
Data: 2 of September 2016
Speaker: Mikhail Patalaha
Topic: ASO how to start and how to finish
INHACKING site: https://inhacking.com
SE2016 site: http://se2016.inhacking.com/
SE2016 UI/UX Alina Kononenko "Designing for Apple Watch and Apple TV"Inhacking
Event: #SE2016
Stage: UI/UX
Data: 2 of September 2016
Speaker: Alina Kononenko
Topic: Designing for Apple Watch and Apple TV
INHACKING site: https://inhacking.com
SE2016 site: http://se2016.inhacking.com/
SE2016 Management Mikhail Lebedinkiy "iAIST the first pure ukrainian corporat...Inhacking
Event: #SE2016
Stage: Management & Trends
Data: 2 of September 2016
Speaker: Mikhail Lebedinkiy
Topic: iAIST the first pure Ukrainian corporate ERP and BI cloud service
INHACKING site: https://inhacking.com
SE2016 site: http://se2016.inhacking.com/
SE2016 Management Anna Lavrova "Gladiator in the suit crisis is our brand!"Inhacking
Event: #SE2016
Stage: Management & Trends
Data: 2 of September 2016
Speaker: Anna Lavrova
Topic: Gladiator in the suit crisis is our brand!
INHACKING site: https://inhacking.com
SE2016 site: http://se2016.inhacking.com/
SE2016 Management Aleksey Solntsev "Management of the projects in the conditi...Inhacking
Event: #SE2016
Stage: Management & Trends
Data: 2 of September 2016
Speaker: Aleksey Solntsev
Topic: Management of the projects in the conditions of uncertainty
INHACKING site: https://inhacking.com
SE2016 site: http://se2016.inhacking.com/
SE2016 Management Vitalii Laptenok "Processes and planning for a product comp...Inhacking
Event: #SE2016
Stage: Management & Trends
Data: 2 of September 2016
Speaker: Vitalii Laptenok
Topic: Processes and planning for a product company
INHACKING site: https://inhacking.com
SE2016 site: http://se2016.inhacking.com/
SE2016 iOS Anton Fedorchenko "Swift for Server-side Development"Inhacking
Event: #SE2016
Stage: iOS
Data: 4 of September 2016
Speaker: Anton Fedorchenko
Topic: Swift for Server-side Development
INHACKING site: https://inhacking.com
SE2016 site: http://se2016.inhacking.com/
SE2016 iOS Alexander Voronov "Test driven development in real world"Inhacking
Event: #SE2016
Stage: iOS
Data: 4 of September 2016
Speaker: Alexander Voronov
Topic: Test Driven Development in real world
INHACKING site: https://inhacking.com
SE2016 site: http://se2016.inhacking.com/
SE2016 JS Gregory Shehet "Undefined on prod, or how to test a react application"Inhacking
Event: #SE2016
Stage: JS
Data: 4 of September 2016
Speaker: Gregory Shehet
Topic: Undefined on prod, or how to test a react application
INHACKING site: https://inhacking.com
SE2016 site: http://se2016.inhacking.com/
SE2016 Java Vladimir Mikhel "Scrapping the web"Inhacking
Event: #SE2016
Stage: Java
Data: 3 of September 2016
Speaker: Vladimir Mikhel
Topic: Scrapping the web
INHACKING site: https://inhacking.com
SE2016 site: http://se2016.inhacking.com/
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big Data analytics with Microsoft Azure"
1. Vitalii Bondarenko
Data Platform Competency Manager at Eleks
Vitaliy.bondarenko@eleks.com
HDInsight: Spark
Advanced in-memory BigData Analytics with Microsoft Azure
3. About me
Vitalii Bondarenko
Data Platform Competency Manager
Eleks
www.eleks.com
20 years in software development
9+ years of developing for MS SQL Server
3+ years of architecting Big Data Solutions
● DW/BI Architect and Technical Lead
● OLTP DB Performance Tuning
●
Big Data Data Platform Architect
5. Spark Stack
● Clustered computing platform
● Designed to be fast and general purpose
● Integrated with distributed systems
● API for Python, Scala, Java, clear and understandable code
● Integrated with Big Data and BI Tools
● Integrated with different Data Bases, systems and libraries like Cassanda, Kafka, H2O
● First Apache release 2013, this moth v.2.0 has been released
8. Execution Model
Spark Execution
● Shells and Standalone application
● Local and Cluster (Standalone, Yarn, Mesos, Cloud)
Spark Cluster Arhitecture
● Master / Cluster manager
● Cluster allocates resources on nodes
● Master sends app code and tasks tor nodes
● Executers run tasks and cache data
Connect to Cluster
● Local
● SparkContext and Master field
● spark://host:7077
● Spark-submit
9. DEMO: Execution Environments
● Local Spark installation
● Shells and Notebook
● Spark Examples
● HDInsight Spark Cluster
● SSH connection to Spark in Azure
● Jupyter Notebook connected to HDInsight Spark
21. Spark Streaming Architecture
● Micro-batch architecture
● SparkStreaming Concext
● Batch interval from 500ms
●
Transformation on Spark Engine
●
Outup operations instead of Actions
● Different sources and outputs
22. Spark Streaming Example
from pyspark.streaming import StreamingContext
ssc = StreamingContext(sc, 1)
input_stream = ssc.textFileStream("sampleTextDir")
word_pairs = input_stream.flatMap(
lambda l:l.split(" ")).map(lambda w: (w,1))
counts = word_pairs.reduceByKey(lambda x,y: x + y)
counts.print()
ssc.start()
ssc.awaitTermination()
● Process RDDs in batches
● Start after ssc.start()
● Output to console on Driver
● Awaiting termination
23. Streaming on a Cluster
● Receivers with replication
● SparkContext on Driver
● Output from Exectors in batches saveAsHadoopFiles()
● spark-submit for creating and scheduling periodical streaming jobs
● Chekpointing for saving results and restore from the point ssc.checkpoint(“hdfs://...”)
24. Streaming Transformations
● DStreams
●
Stateless transformantions
● Stagefull transformantions
● Windowed transformantions
● UpdateStateByKey
● ReduceByWindow, reduceByKeyAndWindow
● Recomended batch size from 10 sec
val ipDStream = accessLogsDStream.map(logEntry => (logEntry.getIpAddress(), 1))
val ipCountDStream = ipDStream.reduceByKeyAndWindow(
{(x, y) => x + y}, // Adding elements in the new batches entering the window
{(x, y) => x - y}, // Removing elements from the oldest batches exiting the window
Seconds(30), // Window duration
Seconds(10)) // Slide duration
26. Spark SQL
●
SparkSQL interface for working with structured data by SQL
●
Works with Hive tables and HiveQL
● Works with files (Json, Parquet etc) with defined schema
●
JDBC/ODBC connectors for BI tools
●
Integrated with Hive and Hive types, uses HiveUDF
●
DataFrame abstraction
27. Spark DataFrames
● hiveCtx.cacheTable("tableName"), in-memory, column-store, while driver is alive
● df.show()
● df.select(“name”, df(“age”)+1)
● df.filtr(df(“age”) > 19)
● df.groupBy(df(“name”)).min()
# Import Spark SQLfrom pyspark.sql
import HiveContext, Row
# Or if you can't include the hive requirementsfrom pyspark.sql
import SQLContext, Row
sc = new SparkContext(...)
hiveCtx = HiveContext(sc)
sqlContext = SQLContext(sc)
input = hiveCtx.jsonFile(inputFile)
# Register the input schema RDD
input.registerTempTable("tweets")
# Select tweets based on the retweet
CounttopTweets = hiveCtx.sql("""SELECT text, retweetCount FROM tweets ORDER BY retweetCount LIMIT 10""")
28. Catalyst: Query Optimizer
● Analysis: map tables, columns, function, create a logical plan
● Logical Optimization: applies rules and optimize the plan
● Physical Planing: physical operator for the logical plan execution
● Cost estimation
SELECT name
FROM (
SELECT id, name
FROM People) p
WHERE p.id = 1
29. DEMO: Using SparkSQL
● Simple SparkSQL querying
● Data Frames
● Data exploration with SparkSQL
● Connect from BI
33. New in Spark 2.0
● Unifying DataFrames and Datasets in Scala/Java (compile time
syntax and analysis errors). Same performance and convertible.
● SparkSession: a new entry point that supersedes SQLContext and
HiveContext.
● Machine learning pipeline persistence
● Distributed algorithms in R
● Faster Optimizer
● Structured Streaming
34. New in Spark 2.0
spark = SparkSession
.builder()
.appName("StructuredNetworkWordCount")
.getOrCreate()
# Create DataFrame representing the stream of input lines from connection to localhost:9999
lines = spark
.readStream
.format('socket')
.option('host', 'localhost')
.option('port', 9999)
.load()
# Split the lines into words
words = lines.select(
explode(
split(lines.value, ' ')
).alias('word')
)
# Generate running word count
wordCounts = words.groupBy('word').count()
# Start running the query that prints the running counts to the console
query = wordCounts
.writeStream
.outputMode('complete')
.format('console')
.start()
query.awaitTermination()
windowedCounts = words.groupBy(
window(words.timestamp, '10 minutes', '5 minutes'),
words.word
).count()
37. HDInsight benefits
● Ease of creating clusters (Azure portal, PowerShell, .Net SDK)
●
Ease of use (noteboks, azure control panels)
●
REST APIs (Livy: job server)
●
Support for Azure Data Lake Store (adl://)
●
Integration with Azure services (EventHub, Kafka)
●
Support for R Server (HDInsight R over Spark)
● Integration with IntelliJ IDEA (Plugin, create and submit apps)
● Concurrent Queries (many users and connections)
● Caching on SSDs (SSD as persist method)
● Integration with BI Tools (connectors for PowerBI and Tableau)
● Pre-loaded Anaconda libraries (200 libraries for ML)
● Scalability (change number of nodes and start/stop cluster)
● 24/7 Support (99% up-time)
38. HDInsight Spark Scenarious
1. Streaming data, IoT and real-time analytics
2. Visual data exploration and interactive analysis (HDFS)
3. Spark with NoSQL (HBase and Azure DocumentDB)
4. Spark with Data Lake
5. Spark with SQL Data Warehouse
6. Machine Learning using R Server, Mllib
7. Putting it all together in a notebook experience
8. Using Excel with Spark