Agile Big Data Analytics Development: An Architecture-Centric ApproachSoftServe
Presented at The Hawaii International Conference on System Sciences by Hong-Mei Chen and Rick Kazman (University of Hawaii), Serge Haziyev (SoftServe).
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2OUz6dt.
Chris Riccomini talks about the current state-of-the-art in data pipelines and data warehousing, and shares some of the solutions to current problems dealing with data streaming and warehousing. Filmed at qconsf.com.
Chris Riccomini works as a Software Engineer at WePay.
There are patterns for things such as domain-driven design, enterprise architectures, continuous delivery, microservices, and many others.
But where are the data science and data engineering patterns?
Sometimes, data engineering reminds me of cowboy coding - many workarounds, immature technologies and lack of market best practices.
The right architecture is key for any IT project. This is especially the case for big data projects, where there are no standard architectures which have proven their suitability over years. This session discusses the different Big Data Architectures which have evolved over time, including traditional Big Data Architecture, Streaming Analytics architecture as well as Lambda and Kappa architecture and presents the mapping of components from both Open Source as well as the Oracle stack onto these architectures.
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Data Con LA
While we frequently talk about how to build interesting products on top of machine and event data, the reality is that collecting, organizing, providing access to, and managing this data is where most people get stuck. In this session, we’ll follow the flow of data through an end to end system built to handle tens of terabytes per day of event-oriented data, providing real time streaming, in-memory, SQL, and batch access to this data. We’ll go into detail on how open source systems such as Hadoop, Kafka, Solr, and Impala/Hive are actually stitched together; describe how and where to perform data transformation and aggregation; provide a simple and pragmatic way of managing event metadata; and talk about how applications built on top of this platform get access to data and extend its functionality. This session is especially recommended for data infrastructure engineers and architects planning, building, or maintaining similar systems.
Agile Big Data Analytics Development: An Architecture-Centric ApproachSoftServe
Presented at The Hawaii International Conference on System Sciences by Hong-Mei Chen and Rick Kazman (University of Hawaii), Serge Haziyev (SoftServe).
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2OUz6dt.
Chris Riccomini talks about the current state-of-the-art in data pipelines and data warehousing, and shares some of the solutions to current problems dealing with data streaming and warehousing. Filmed at qconsf.com.
Chris Riccomini works as a Software Engineer at WePay.
There are patterns for things such as domain-driven design, enterprise architectures, continuous delivery, microservices, and many others.
But where are the data science and data engineering patterns?
Sometimes, data engineering reminds me of cowboy coding - many workarounds, immature technologies and lack of market best practices.
The right architecture is key for any IT project. This is especially the case for big data projects, where there are no standard architectures which have proven their suitability over years. This session discusses the different Big Data Architectures which have evolved over time, including traditional Big Data Architecture, Streaming Analytics architecture as well as Lambda and Kappa architecture and presents the mapping of components from both Open Source as well as the Oracle stack onto these architectures.
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Data Con LA
While we frequently talk about how to build interesting products on top of machine and event data, the reality is that collecting, organizing, providing access to, and managing this data is where most people get stuck. In this session, we’ll follow the flow of data through an end to end system built to handle tens of terabytes per day of event-oriented data, providing real time streaming, in-memory, SQL, and batch access to this data. We’ll go into detail on how open source systems such as Hadoop, Kafka, Solr, and Impala/Hive are actually stitched together; describe how and where to perform data transformation and aggregation; provide a simple and pragmatic way of managing event metadata; and talk about how applications built on top of this platform get access to data and extend its functionality. This session is especially recommended for data infrastructure engineers and architects planning, building, or maintaining similar systems.
As part of this session, I will be giving an introduction to Data Engineering and Big Data. It covers up to date trends.
* Introduction to Data Engineering
* Role of Big Data in Data Engineering
* Key Skills related to Data Engineering
* Role of Big Data in Data Engineering
* Overview of Data Engineering Certifications
* Free Content and ITVersity Paid Resources
Don't worry if you miss the video - you can click on the below link to go through the video after the schedule.
https://youtu.be/dj565kgP1Ss
* Upcoming Live Session - Overview of Big Data Certifications (Spark Based) - https://www.meetup.com/itversityin/events/271739702/
Relevant Playlists:
* Apache Spark using Python for Certifications - https://www.youtube.com/playlist?list=PLf0swTFhTI8rMmW7GZv1-z4iu_-TAv3bi
* Free Data Engineering Bootcamp - https://www.youtube.com/playlist?list=PLf0swTFhTI8pBe2Vr2neQV7shh9Rus8rl
* Join our Meetup group - https://www.meetup.com/itversityin/
* Enroll for our labs - https://labs.itversity.com/plans
* Subscribe to our YouTube Channel for Videos - http://youtube.com/itversityin/?sub_confirmation=1
* Access Content via our GitHub - https://github.com/dgadiraju/itversity-books
* Lab and Content Support using Slack
What’s New with Databricks Machine LearningDatabricks
In this session, the Databricks product team provides a deeper dive into the machine learning announcements. Join us for a detailed demo that gives you insights into the latest innovations that simplify the ML lifecycle — from preparing data, discovering features, and training and managing models in production.
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...Data Con LA
At IRIS.TV, our business builds algorithmic solutions for video recommendation with the end goal to deliver a great user experience as evidenced by users viewing more video content. This talk outlines our reasons for expanding from a descriptive/predictive approach to data analytics toward a philosophy that features more prescriptive analytics, driven by our data science team.
In this slidedeck, Infochimps Director of Product, Tim Gasper, discusses how Infochimps tackles business problems for customers by deploying a comprehensive Big Data infrastructure in days; sometimes in just hours. Tim unlocks how Infochimps is now taking that same aggressive approach to deliver faster time to value by helping customers develop analytic applications with impeccable speed.
Many organizations focus on the licensing cost of Hadoop when considering migrating to a cloud platform. But other costs should be considered, as well as the biggest impact, which is the benefit of having a modern analytics platform that can handle all of your use cases. This session will cover lessons learned in assisting hundreds of companies to migrate from Hadoop to Databricks.
Exploring Data Preparation and Visualization Tools for Urban ForestryAzavea
This webinar was held on December 12, 2012 and provided an overview of free and low-cost tools for cleaning and preparing data and building useful and beautiful data visualizations.
As part of this session, I will be giving an introduction to Data Engineering and Big Data. It covers up to date trends.
* Introduction to Data Engineering
* Role of Big Data in Data Engineering
* Key Skills related to Data Engineering
* Role of Big Data in Data Engineering
* Overview of Data Engineering Certifications
* Free Content and ITVersity Paid Resources
Don't worry if you miss the video - you can click on the below link to go through the video after the schedule.
https://youtu.be/dj565kgP1Ss
* Upcoming Live Session - Overview of Big Data Certifications (Spark Based) - https://www.meetup.com/itversityin/events/271739702/
Relevant Playlists:
* Apache Spark using Python for Certifications - https://www.youtube.com/playlist?list=PLf0swTFhTI8rMmW7GZv1-z4iu_-TAv3bi
* Free Data Engineering Bootcamp - https://www.youtube.com/playlist?list=PLf0swTFhTI8pBe2Vr2neQV7shh9Rus8rl
* Join our Meetup group - https://www.meetup.com/itversityin/
* Enroll for our labs - https://labs.itversity.com/plans
* Subscribe to our YouTube Channel for Videos - http://youtube.com/itversityin/?sub_confirmation=1
* Access Content via our GitHub - https://github.com/dgadiraju/itversity-books
* Lab and Content Support using Slack
What’s New with Databricks Machine LearningDatabricks
In this session, the Databricks product team provides a deeper dive into the machine learning announcements. Join us for a detailed demo that gives you insights into the latest innovations that simplify the ML lifecycle — from preparing data, discovering features, and training and managing models in production.
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...Data Con LA
At IRIS.TV, our business builds algorithmic solutions for video recommendation with the end goal to deliver a great user experience as evidenced by users viewing more video content. This talk outlines our reasons for expanding from a descriptive/predictive approach to data analytics toward a philosophy that features more prescriptive analytics, driven by our data science team.
In this slidedeck, Infochimps Director of Product, Tim Gasper, discusses how Infochimps tackles business problems for customers by deploying a comprehensive Big Data infrastructure in days; sometimes in just hours. Tim unlocks how Infochimps is now taking that same aggressive approach to deliver faster time to value by helping customers develop analytic applications with impeccable speed.
Many organizations focus on the licensing cost of Hadoop when considering migrating to a cloud platform. But other costs should be considered, as well as the biggest impact, which is the benefit of having a modern analytics platform that can handle all of your use cases. This session will cover lessons learned in assisting hundreds of companies to migrate from Hadoop to Databricks.
Exploring Data Preparation and Visualization Tools for Urban ForestryAzavea
This webinar was held on December 12, 2012 and provided an overview of free and low-cost tools for cleaning and preparing data and building useful and beautiful data visualizations.
Reinventing the Modern Information Pipeline: Paxata and MapRLilia Gutnik
(Presented at MapR's Big Data Everywhere event in Redwood City, CA in December 2016)
The relationship between business teams and IT has changed as the complexity of data has increased. A traditional data pipeline designed for an IT-centered approach to information management is not designed for the data demands of today's business decisions. Designing a big data strategy requires modernizing previous approaches. Self-service data preparation in a collaborative, intuitive, governed, and secure environment is the key to a nimble and decisive business unit.
Netflix Data Engineering @ Uber Engineering MeetupBlake Irvine
People, Platform, Projects: these slides overview how Netflix works with Big Data. I share how our teams are organized, the roles we typically have on the teams, an overview of our Big Data Platform, and two example projects.
Data Preparation vs. Inline Data Wrangling in Data Science and Machine LearningKai Wähner
Comparison of Data Preparation vs. Data Wrangling Programming Languages, Frameworks and Tools in Machine Learning / Deep Learning Projects.
A key task to create appropriate analytic models in machine learning or deep learning is the integration and preparation of data sets from various sources like files, databases, big data storages, sensors or social networks. This step can take up to 80% of the whole project.
This session compares different alternative techniques to prepare data, including extract-transform-load (ETL) batch processing (like Talend, Pentaho), streaming analytics ingestion (like Apache Storm, Flink, Apex, TIBCO StreamBase, IBM Streams, Software AG Apama), and data wrangling (DataWrangler, Trifacta) within visual analytics. Various options and their trade-offs are shown in live demos using different advanced analytics technologies and open source frameworks such as R, Python, Apache Hadoop, Spark, KNIME or RapidMiner. The session also discusses how this is related to visual analytics tools (like TIBCO Spotfire), and best practices for how the data scientist and business user should work together to build good analytic models.
Key takeaways for the audience:
- Learn various options for preparing data sets to build analytic models
- Understand the pros and cons and the targeted persona for each option
- See different technologies and open source frameworks for data preparation
- Understand the relation to visual analytics and streaming analytics, and how these concepts are actually leveraged to build the analytic model after data preparation
Video Recording / Screencast of this Slide Deck: https://youtu.be/2MR5UynQocs
After the construction of several datalakes and large business intelligence pipelines, we now know that the use of Scala and its principles were essential to the success of those large undertakings.
In this talk, we will go through the 7 key scala-based architectures and methodologies that were used in real-life projects. More specifically, we will see the impact of these recipes on Spark performances, and how it enabled the rapid growth of those projects.
Validate data
Questionnaire checking
Edit acceptable questionnaires
Code the questionnaires
Keypunch the data
Clean the data set
Statistically adjust the data
Store the data set for analysis
Analyse data
This presentation provides an introduction to Azure DocumentDB. Topics include elastic scale, global distribution and guaranteed low latencies (with SLAs) - all in a managed document store that you can query using SQL and Javascript. We also review common scenarios and advanced Data Sciences scenarios.
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
Tackling the challenge of designing a machine learning model and putting it into production is the key to getting value back – and the roadblock that stops many promising machine learning projects. After the data scientists have done their part, engineering robust production data pipelines has its own set of challenges. Syncsort software helps the data engineer every step of the way.
Building on the process of finding and matching duplicates to resolve entities, the next step is to set up a continuous streaming flow of data from data sources so that as the sources change, new data automatically gets pushed through the same transformation and cleansing data flow – into the arms of machine learning models.
Some of your sources may already be streaming, but the rest are sitting in transactional databases that change hundreds or thousands of times a day. The challenge is that you can’t affect performance of data sources that run key applications, so putting something like database triggers in place is not the best idea. Using Apache Kafka or similar technologies as the backbone to moving data around doesn’t solve the problem of needing to grab changes from the source pushing them into Kafka and consuming the data from Kafka to be processed. If something unexpected happens – like connectivity is lost on either the source or the target side, you don’t want to have to fix it or start over because the data is out of sync.
View this 15-minute webcast on-demand to learn how to tackle these challenges in large scale production implementations.
Getting real-time analytics for devices/application/business monitoring from trillions of events and petabytes of data like companies Netflix, Uber, Alibaba, Paypal, Ebay, Metamarkets do.
Building a scalable analytics environment to support diverse workloadsAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Building a scalable analytics environment to support diverse workloads
Tom Panozzo, Chief Technology Officer (Aunalytics)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
DoneDeal AWS Data Analytics Platform build using AWS products: EMR, Data Pipeline, S3, Kinesis, Redshift and Tableau. Custom built ETL was written using PySpark.
A data lake can be used as a source for both structured and unstructured data - but how? We'll look at using open standards including Spark and Presto with Amazon EMR, Amazon Redshift Spectrum and Amazon Athena to process and understand data.
Speakers:
Neel Mitra - Solutions Architect, AWS
Roger Dahlstrom - Solutions Architect, AWS
In this presentation, you will get a look under the covers of Amazon Redshift, a fast, fully-managed, petabyte-scale data warehouse service for less than $1,000 per TB per year. Learn how Amazon Redshift uses columnar technology, optimized hardware, and massively parallel processing to deliver fast query performance on data sets ranging in size from hundreds of gigabytes to a petabyte or more. We'll also walk through techniques for optimizing performance and, you’ll hear from a specific customer and their use case to take advantage of fast performance on enormous datasets leveraging economies of scale on the AWS platform.
Speakers:
Ian Meyers, AWS Solutions Architect
Toby Moore, Chief Technology Officer, Space Ape
Data Lake and the rise of the microservicesBigstep
By simply looking at structured and unstructured data, Data Lakes enable companies to understand correlations between existing and new external data - such as social media - in ways traditional Business Intelligence tools cannot.
For this you need to find out the most efficient way to store and access structured or unstructured petabyte-sized data across your entire infrastructure.
In this meetup we’ll give answers on the next questions:
1. Why would someone use a Data Lake?
2. Is it hard to build a Data Lake?
3. What are the main features that a Data Lake should bring in?
4. What’s the role of the microservices in the big data world?
Off-Label Data Mesh: A Prescription for Healthier DataHostedbyConfluent
"Data mesh is a relatively recent architectural innovation, espoused as one of the best ways to fix analytic data. We renegotiate aged social conventions by focusing on treating data as a product, with a clearly defined data product owner, akin to that of any other product. In addition, we focus on building out a self-service platform with integrated governance, letting consumers safely access and use the data they need to solve their business problems.
Data mesh is prescribed as a solution for _analytical data_, so that conventionally analytical results (think weekly sales or monthly revenue reports) can be more accurately and predictably computed. But what about non-analytical business operations? Would they not also benefit from data products backed by self-service capabilities and dedicated owners? If you've ever provided a customer with an analytical report that differed from their operational conclusions, then this talk is for you.
Adam discusses the resounding successes he has seen from applying data mesh _off-label_ to both analytical and operational domains. The key? Event streams. Well-defined, incrementally updating data products that can power both real-time and batch-based applications, providing a single source of data for a wide variety of application and analytical use cases. Adam digs into the common areas of success seen across numerous clients and customers and provides you with a set of practical guidelines for implementing your own minimally viable data mesh.
Finally, Adam covers the main social and technical hurdles that you'll encounter as you implement your own data mesh. Learn about important data use cases, data domain modeling techniques, self-service platforms, and building an iteratively successful data mesh."
Thing you didn't know you could do in SparkSnappyData
This presentation discusses issues with the modern lambda architecture and how Spark attempts to solve them with structured streaming and interactive querying. It then shows how SnappyData takes these solutions one step further with its Synopsis Data Engine
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
Accelerating TensorFlow with RDMA for high-performance deep learningDataWorks Summit
Google’s TensorFlow is one of the most popular deep learning (DL) frameworks. In distributed TensorFlow, gradient updates are a critical step governing the total model training time. These updates incur a massive volume of data transfer over the network.
In this talk, we first present a thorough analysis of the communication patterns in distributed TensorFlow. Then we propose a unified way of achieving high performance through enhancing the gRPC runtime with Remote Direct Memory Access (RDMA) technology on InfiniBand and RoCE. Through our proposed RDMA-gRPC design, TensorFlow only needs to run over the gRPC channel and gets the optimal performance. Our design includes advanced features such as message pipelining, message coalescing, zero-copy transmission, etc. The performance evaluations show that our proposed design can significantly speed up gRPC throughput by up to 1.5x compared to the default gRPC design. By integrating our RDMA-gRPC with TensorFlow, we are able to achieve up to 35% performance improvement for TensorFlow training with CNN models.
Speakers
Dhabaleswar K (DK) Panda, Professor and University Distinguished Scholar, The Ohio State University
Xiaoyi Lu, Research Scientist, The Ohio State University
Idera live 2021: Managing Databases in the Cloud - the First Step, a Succes...IDERA Software
You need to start moving some on-premises databases to the cloud.
- Where do you begin?
- What are your options?
- What will your job look like afterward?
-What tools can you use to manage databases in the cloud?
- How does troubleshooting database performance problems in the cloud differ from on-premise?
- How can you help manage monthly cloud costs so the effort actually is cost effective?
Moving to the cloud is not as easy as one might think. So, knowing the answers to these kinds of question will place your feet on the path to success. See how DB PowerStudio can readily assist with these efforts and questions.
The presenter, Bert Scalzo, is an Oracle ACE, blogger, author, speaker and database technology consultant. He has worked with all major relational databases, including Oracle, SQL Server, Db2, Sybase, MySQL, and PostgreSQL. Bert’s work experience includes stints as product manager for multiple-database tools, such as DBArtisan and Aqua Data Studio at IDERA. He has three decades of Oracle database experience and previously worked for both Oracle Education and Oracle Consulting. Bert holds several Oracle Masters certifications and his academic credentials include a BS, MS, and PhD in computer science, as well as an MBA.
Similar to Essential Data Engineering for Data Scientist (20)
You have a roadmap of how to bring your next digital innovation to market, designed to transform your business model for the future. But if the first stop on the journey of development isn’t incorporating plans for assessing quality of the end product, evaluating your overall processes, and tracking the on-going health of your development, then you are sure to discover many more costly pitfalls along the way.
During this free on-demand webinar, you will learn:
What is digital transformation and why should we care about it? (Part 1)
How to change in the digital era with preserving the quality? (Part 1)
4 main steps that will help you improve quality while going digital (Part 2)
Whether you're a huge enterprise or a small start-up, you can't escape global digitalization. As digital technologies like machine-2-machine communication, device-2-device telematics, connected cars, and the Internet of Things become more integral in today’s world, more threats will appear as hackers use new ways to exploit weaknesses in your organization and products.
During SoftServe’s free security webinar, Nazar Tymoshyk will explore the reasons why recent victims of digital attacks couldn’t withstand a threat to their security and share how you can build secure and compliant software with the help of security experts. A real-life case study will demonstrate how SoftServe assessed and mitigated security threats for a top organization.
How to Reduce Time to Market Using Microsoft DevOps SolutionsSoftServe
Microsoft DevOps toolset replaces error-prone manual processes with automation for improved traceability and repeatable workflows.
Learn more about:
- The benefits of Continuous Integration practice
- Continuous Deployment as an accelerator to deliver high quality software
- How to use Visual Studio Team Services and Microsoft Azure to decrease rework and increase team productivity
We are now witnessing a new wave of IT revolution and its effect is very similar to the Cloud and Virtualization revolutions that started in the last decade. This new wave, called Containerization, is related to technologies such as Docker and Kubernetes, which now fuel large scale solutions including Big Data and IoT.
Learn about:
- Typical DevOps challenges and modern solutions
- Using Docker as Amazon EC2 Container Service Evolution of Enterprise Architecture (Containers, IoT, Machine Learning and technologies of tomorrow)
- Business value of using advances DevOps technologies with real-life case study
Advanced Analytics and Data Science ExpertiseSoftServe
An overview of SoftServe's Data Science service line.
- Data Science Group
- Data Science Offerings for Business
- Machine Learning Overview
- AI & Deep Learning Case Studies
- Big Data & Analytics Case Studies
Visit our website to learn more: http://www.softserveinc.com/en-us/
Big Data as a Service: A Neo-Metropolis Model Approach for InnovationSoftServe
Presented at The Hawaii International Conference on System Sciences by Hong-Mei Chen and Rick Kazman (University of Hawaii), Serge Haziyev and Valentyn Kropov (SoftServe), Dmitri Chtchoutov.
Personalized Medicine in a Contemporary World by Eugene Borukhovich, SVP Heal...SoftServe
• Latest advances in the personalized medicine market
• Impact and trends around consumerism and big data
• How technology is driving digital health forward
How to Implement Hybrid Cloud Solutions SuccessfullySoftServe
There are a vast range of new technological trends appearing on the market, among them Hybrid Cloud. According to a recent Gartner report Computing Innovations That Organizations Should Monitor 2015, the “Cloud” trend has been replaced by “Hybrid Cloud”, but what exactly is this new trend?
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
2. Me, myself, and I: Valentyn Kropov
• Sr. Big Data Solutions Architect.
• 14 years of work experience with Databases.
• 4 years in Big Data.
• Big Data Consulting Lead at SoftServe (20+ Engineers and Architects).
• Founder of Kyiv Big Data Community (600+ people).
webinar
3. Agenda
1. Level of Involvement
2. Choosing the Right Tools (Distribution of Hadoop)
3. RDBMS vs. NoSQL
4. NoSQL Data Modeling
5. Deployment
6. On-Premises vs. Cloud
7. Scalability and Performance
8. Storage
webinar
6. Project Stages from Data Engineering Perspective
1. Statement of work
2. Requirements
3. Architecture
4. Infrastructure
5. Data modeling/ETL
6. Data Science modeling
webinar
7. Involvement: Checklist
1. You’re the boss!
2. You have a right to demand the infrastructure you need.
3. But, you need to have perfect argumentation.
4. And I’ll show it to you right now.
webinar
10. Big Data Analytics Reference Architecture
A modern-integrated approach for solving Big Data/Business Analytics needs
across multiple verticals and domains.
All Data
Real-time Data Processing
Data Acquisition and Storing
DataIntegration Enterprise
Data Warehousing
Data Management
(Governance, Security, Quality, MDM)
Analytics
Reporting and
Analysis
Predictive
Modeling
Data Mining
Data Lake
(Landing, Exploration
and Archiving)
UX and
Visualization
Applications
Application
data
Media data:
images,
video, etc
Social data
Enterprise
content
data
Machine,
sensor, log
data
Docs and
archives
data
Customer
Analytics
Marketing
Analytics
Web/Mobile/
Social
Analytics
IT
Operational
Analytics
Fraud and
Risk
Analytics
Complex Event
Processing
Real-time Query
and Search
11. Hortonworks vs. Cloudera vs. MapR
Hortonworks Cloudera MapR
File system HDFS HDFS MapR FS
Non-Hadoop Access NFS Fuse-DFS Direct Access NFS
Data Integration Services TalenD - -
Data Analysis Framework - Data Fu -
Software Abstraction Layer - - Apache Cascading
Web Access WebHDFS HTTPFS -
Parallel Query Execution Tez (Stinger) Impala -
Installation Ambari Cloudera Manager -
Security - Sentry -
Monitoring Gangila/Nagios - -
Non-mapr Reduce Tasks YARN YARN -
http://www.networkworld.com/article/2369327/software/comparing-the-top-hadoop-distributions.html
webinar
12. Or Even More: IBM, Oracle, Amazon, …
1. IBM: Big R (set of Data Science algorithms) and Big SQL (SQL-like interface to data).
2. Oracle: Big Data appliance/connectors.
3. Amazon: Elastic MapReduce.
13. Choosing the Right Tools: Example (Description)
Data Volume:
• 270-300 Web Servers (Apache HTTPD)
• 447 392 events per minute
• 644 245 094 events / day
• ~100-250 bytes per event
• 150GB of data per day
Log Types:
• Apache HTTPD access log
• Apache HTTPD error log
• Service log (CPU, RAM, I/O, Disk)
• Application server servlet log
Retention:
• Last 30 days: Raw data
• Last 24 hours: per minute aggregation
• Whole period: per hour aggregation
15. Choosing the Right Tools: Example (Description - data)
Access log:
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
Error log:
[Sun Mar 7 20:58:27 2004] [info] [client 64.242.88.10] (104)Connection reset by peer: client stopped connection before send body completed
[Sun Mar 7 21:16:17 2004] [error] [client 24.70.56.49] File does not exist: /home/httpd/twiki/view/Main/WebHome
Vmstat
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 305416 260688 29160 2356920 2 2 4 1 0 0 6 1 92 2 0
iostat
Linux 2.6.32-100.28.5.el6.x86_64 (dev-db) 07/09/2011
avg-cpu: %user %nice %system %iowait %steal %idle
5.68 0.00 0.52 2.03 0.00 91.76
webinar
20. Choosing the Right Tools: Checklist
1. Fastest random access to the data: Cloudera (Impala).
2. Universal (and fast!) access to data: MapR (MapR FS).
3. Data Integration: Hortonworks (built-in TalenD).
4. Never trust papers, always double check: Proof-of-Concept.
5. Lastly, ensure you have rightsizing and check every element of the chain!
webinar
23. It’s Not Necessarily Always Black and White!
• Traditional-relational
• Extended-relational
• Non-relational
• Lambda architecture (Hybrid)
• Data refinery (Hybrid)
webinar
24. SoftServe Lambda Architecture Accelerator
• Lambda architecture – Is a highly scalable and reliable data processing architecture based on Twitter
successful experience in Big Data and Analytics.
• Supports majority of use cases: Real-time analytics, data discovery, and business reports.
• SoftServe’s pre-built Lambda architecture stack accelerates customer’s Time to Market (TTM) to 15-20+
man/month.
25. RDBMS vs NoSQL: Checklist
1. RDBMS: Structured data, moderate velocity and volume (up to TB),
with complex transactions.
2. NoSQL: Unstructured data, high velocity or volume (up to PB+),
with simple transactions.
3. Hybrid, Lambda, Refinery: Something in-between.
27. NoSQL: How is it Different than RDBMS?
1. Write operations are cheap.
2. Less transactions and is less consistent.
3. Read operations are blazingly fast!
webinar
28. NoSQL: Two Main Rules to Remember
1. Spread Data evenly around the cluster.
2. Minimize the number of partitions read.
webinar
29. RDBMS: Queries Around Model
Q1: People who live in state X.
Q2: People who live in city Y.
Q3: People who live at address Z.
webinar
30. NoSQL: Model Around Queries!
Q1: People who live in state X. Q2: People who live in city Y. Q3: People who live at address Z.
People_by_States
state - Partition / Primary Key
country
first_name
last_name
city
street_name1
street_name2
street_number
People_by_City
city - Partition / Primary Key
country
first_name
last_name
state
street_name1
street_name2
street_number
People_by_FullAddress
country, city, state, street_name1 –
Partition / Primary Key
first_name
last_name
street_name2
street_number
webinar
31. Data Modeling: Checklist
1. In NoSQL, you can have a table for each query, and it is totally OK, don’t save disk space!
(sacrifice cheap writes for the fastest reads).
2. There are (almost) no secondary indexes in NoSQL, only primary.
3. Pick up correct primary (partitioning) key to read only one partition per request.
webinar
33. Deployment Defined
In short, deployment is the litmus paper for a project that defines the level of maturity.
And, the overall project success depends on it.
webinar
34. Deployment Stages
1. Bootstrapping: Create VM’s and hosts.
2. Provisioning: Install software like Hadoop.
3. Configuration: Initial parameters and data.
4. Validation: Verify installation.
webinar
38. Service Layout & Memory Allocation
http://blog.cloudera.com/blog/2015/01/how-to-deploy-apache-
hadoop-clusters-like-a-boss/
39. Automation: Checklist
1. Deployment should be fully automated (Terraform and Ansible).
2. Ensure service layout is correct (master nodes, worker nodes, and edge nodes).
3. Double check to see if enough memory has been given for nodes
(~64-128GB for master/edge nodes, ~256-512GB for data/workers nodes).
webinar
41. On-Premises
(real hardware somewhere in your building or data center)
1. Highest data privacy (Regulations and sensitive data).
2. Quickest access to data (Latency).
3. Best velocity (Transfer rates).
4. Existing Hardware.
5. Control over resource usage.
webinar
43. Hybrid
1. Hybrid: a combination of on-premises and cloud.
2. On-premises: sensitive information and data for high-performance access.
3. Cloud: non-sensitive data.
webinar
44. On-Premises vs. Cloud
1. Oracle ExaData ~ $1.000.000
2. Biggest instance in Amazon EC2 (40CPU) ~ 50 years!
webinar
45. On-Premises vs. Cloud: Checklist
1. On-premises: If customer has existing unused hardware, has predicted data volume
growth, or has strong data security requirements.
2. Cloud: If the customer doesn’t have a large budget, is not sure about data & load
growth, and doesn’t have strong security requirements or a team of engineers to
support hardware.
3. Hybrid: Mixture of requirements above.
webinar
47. Dedicated Clusters
Visualization Service
Data Ingestion Service
Analytics Service
VM1 VM2 VM3
VM1 VM2 VM2
VM4 VM5 VM6
VM7 VM8
• Configuration and
management of 3
separate clusters.
• Resources stay idle if
service is not active.
• Need to move data
between clusters for
each service.
webinar
48. Shared Clusters
Visualization Service
Data Ingestion Service
Analytics Service
Multiple clusters
Multiple clusters
...to maximize utilization
...to share data between services
webinar
50. Shared Clusters: Mesos/Docker
Maximize utilization & performance:
Deliver more services with smaller footprint.
Shared clusters for all services:
Easier deployment and management with unified service platform.
Shared data between services:
Faster and more competitive services and solutions.
webinar
51. How Does this Work?
Zookeeper quorum
Mesos Master Mesos Master Mesos Master
Spark Service Scheduler Marathon Service Scheduler
Mesos Slave
Spark Task Executor Mesos Executor
Mesos Slave
Docker Executor Docker Executor
Task #1 Task #2 ./python XYZ java -jar XYZ.jar ./xyz
52. How Does this Work?
Mesos provides fine grained resource isolation
Mesos Slave Process
Spark Task Executor Mesos Executor
Task #2 ./python XYZ
Compute Node
Executor
Container
(cgroups)
Task #1
webinar
53. How Does this Work?
Mesos provides scalability
Mesos Slave Process
Spark Task Executor
Task #2
Compute Node
Container
(cgroups)
Task #1
Python executor finished,
more available resources,
and more spark.
Task #4Task #3
webinar
54. How Does this Work?
VM5VM1 VM2 VM3 VM4
Mesos has no single point of failure Services keep running if VM fails!
Mesos Master
Mesos Master Mesos Master
webinar
55. How Does this Work?
VM5VM1 VM2 VM3 VM4
Master node can failover Services keep running if Mesos Master fails!
Mesos Master
Mesos Master Mesos Master
webinar
56. How Does this Work?
Slave process can failover Tasks keep running if Mesos Slave Process fails!
Mesos Slave Process
Spark Task Executor
Task #2
Compute Node
Task #1 Task #4Task #3
webinar
57. Scalability & Performance: Checklist
1. If you need real scalability then use shared clusters.
2. Shared clusters love to host in Cloud.
3. Scalability means performance (in most cases). Use it as a synonym.
webinar
63. Storage Comparison
1. Amazon S3: universal access, cheap, and data needs to be copied before processing.
2. HDFS: compatible with Hadoop ecosystem, relatively cheap, and data can be processed
where it is being stored.
3. Directly Attached Storage/Network Attached Storage: expensive, fastest access to data,
and it also can be processed where data is being stored.
webinar
64. Storage: Checklist
1. If you need unified access to data and use some universal Cloud FS,
then this would be similar to Amazon S3.
2. For immediate access to data (OLTP system), you need Directly
Attached Storage (DAS), Network Attached Storage (NAS), Elastic Block
Storage (Amazon EBS), and so on.
3. If you choose NoSQL, you’ll need much more space than actual data
(each query might require duplicate copy of data).
4. Pick storage carefully and use PoC/Prototyping, otherwise changing
storage later on will be hard to almost impossible.
webinar
66. Final Checklist
1. You’re the Boss!
2. You have a right to demand the infrastructure you need.
3. However, you need to have perfect argumentation.
4. Now you have it and know where to get details.
5. Good luck and see you in the field!
webinar