Sharing about how Lazada overcame our challenges with scaling and having a proper data culture at the Big Data and Analytics Innovation Summit Singapore 2018
INSEAD Sharing on Lazada Data Science and my JourneyEugene Yan Ziyou
Sharing about how Lazada applies data science to improve customer and seller experience, and my personal journey to my current role in Lazada as Data Science Lead, VP
How Lazada ranks products to improve customer experience and conversionEugene Yan Ziyou
Slides from sharing at Strata + Hadoop Singapore 2016 (http://conferences.oreilly.com/strata/hadoop-big-data-sg/public/schedule/detail/54542)
Ecommerce has enabled retailers to make all of their products available to consumers and consumers to access niche products not found in brick-and-mortar stores. This growth provides consumers with unparalleled choice. Nonetheless, the sheer number of products brings with it the challenge of helping users find relevant products with ease.
Lazada has tens of millions of products on its platform, and this number grows by approximately one million monthly. Lazada’s challenge: How can we help users easily discover good quality products they will like? How can we ensure product selection remains fresh and constantly updated?
One way to do this is through the ranking of products. Via ranking, Lazada helps customers easily find products that will delight them by ensuring these products appear in the first few pages. I’ll share how Lazada ranks products on our website. (Note: Google “how amazon ranks products” for some industry background)
Topics include how we:
* Develop methodology (and tricks) to solve not-so-well-defined problems
* Collect and store user-behavior data from our website and app
* Clean and prepare the data (e.g., handling outliers)
* Discover and create features useful features
* Build models to improve customer experience and meet business objectives
* Measure and test outcomes on our website
* Built this end-to-end on our Hadoop infrastructure, with tools including Kafka and Spark
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerSachin Aggarwal
RDD recap
Spark SQL library
Architecture of Spark SQL
Comparison with Pig and Hive Pipeline
DataFrames
Definition of a DataFrames API
DataFrames Operations
DataFrames features
Data cleansing
Diagram for logical plan container
Plan Optimization & Execution
Catalyst Analyzer
Catalyst Optimizer
Generating Physical Plan
Code Generation
Extensions
These slides were presented by Mateusz Dymcyzk at the our Sydney AI and deep learning meetup. The authors were Sudalai Rajkumar (SRK), Data Scientist at H2O.ai and Nikhil Shekhar, ML Engineer at H2O.ai.
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...Databricks
Big companies typically integrate their data from various heterogeneous systems when building a data lake as single point for accessing data. To achieve this goal technical teams often deal with data defined by complex schemas and various data formats. Spark SQL Datasets are currently compatible with data formats such as XML, Avro and Parquet by providing primitive and complex data types such as structs and arrays.
Although Dataset API offers rich set of functions, general manipulation of array and deeply nested data structures is lacking. We will demonstrate this fact by providing examples of data which is currently very hard to process in Spark efficiently. We designed and developed an extension of Dataset API to allow developers to work with array and complex type elements in a more straightforward and consistent way. The extension should help users dealing with complex and structured big data to use Apache Spark as a truly generic processing framework.
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsAnant Corporation
During this lunch, we’ll review open-source reverse ETL tools to uncover how to send data back to SaaS systems.
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Data Engineer’s Lunch Weekly at 12 PM EST Every Monday:
https://www.meetup.com/Data-Wranglers-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
#data #dataengineering #datagovernance
INSEAD Sharing on Lazada Data Science and my JourneyEugene Yan Ziyou
Sharing about how Lazada applies data science to improve customer and seller experience, and my personal journey to my current role in Lazada as Data Science Lead, VP
How Lazada ranks products to improve customer experience and conversionEugene Yan Ziyou
Slides from sharing at Strata + Hadoop Singapore 2016 (http://conferences.oreilly.com/strata/hadoop-big-data-sg/public/schedule/detail/54542)
Ecommerce has enabled retailers to make all of their products available to consumers and consumers to access niche products not found in brick-and-mortar stores. This growth provides consumers with unparalleled choice. Nonetheless, the sheer number of products brings with it the challenge of helping users find relevant products with ease.
Lazada has tens of millions of products on its platform, and this number grows by approximately one million monthly. Lazada’s challenge: How can we help users easily discover good quality products they will like? How can we ensure product selection remains fresh and constantly updated?
One way to do this is through the ranking of products. Via ranking, Lazada helps customers easily find products that will delight them by ensuring these products appear in the first few pages. I’ll share how Lazada ranks products on our website. (Note: Google “how amazon ranks products” for some industry background)
Topics include how we:
* Develop methodology (and tricks) to solve not-so-well-defined problems
* Collect and store user-behavior data from our website and app
* Clean and prepare the data (e.g., handling outliers)
* Discover and create features useful features
* Build models to improve customer experience and meet business objectives
* Measure and test outcomes on our website
* Built this end-to-end on our Hadoop infrastructure, with tools including Kafka and Spark
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerSachin Aggarwal
RDD recap
Spark SQL library
Architecture of Spark SQL
Comparison with Pig and Hive Pipeline
DataFrames
Definition of a DataFrames API
DataFrames Operations
DataFrames features
Data cleansing
Diagram for logical plan container
Plan Optimization & Execution
Catalyst Analyzer
Catalyst Optimizer
Generating Physical Plan
Code Generation
Extensions
These slides were presented by Mateusz Dymcyzk at the our Sydney AI and deep learning meetup. The authors were Sudalai Rajkumar (SRK), Data Scientist at H2O.ai and Nikhil Shekhar, ML Engineer at H2O.ai.
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...Databricks
Big companies typically integrate their data from various heterogeneous systems when building a data lake as single point for accessing data. To achieve this goal technical teams often deal with data defined by complex schemas and various data formats. Spark SQL Datasets are currently compatible with data formats such as XML, Avro and Parquet by providing primitive and complex data types such as structs and arrays.
Although Dataset API offers rich set of functions, general manipulation of array and deeply nested data structures is lacking. We will demonstrate this fact by providing examples of data which is currently very hard to process in Spark efficiently. We designed and developed an extension of Dataset API to allow developers to work with array and complex type elements in a more straightforward and consistent way. The extension should help users dealing with complex and structured big data to use Apache Spark as a truly generic processing framework.
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsAnant Corporation
During this lunch, we’ll review open-source reverse ETL tools to uncover how to send data back to SaaS systems.
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Data Engineer’s Lunch Weekly at 12 PM EST Every Monday:
https://www.meetup.com/Data-Wranglers-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
#data #dataengineering #datagovernance
"Structured Streaming was a new streaming API introduced to Spark over 2 years ago in Spark 2.0, and was announced GA as of Spark 2.2. Databricks customers have processed over a hundred trillion rows in production using Structured Streaming. We received dozens of questions on how to best develop, monitor, test, deploy and upgrade these jobs. In this talk, we aim to share best practices around what has worked and what hasn't across our customer base.
We will tackle questions around how to plan ahead, what kind of code changes are safe for structured streaming jobs, how to architect streaming pipelines which can give you the most flexibility without sacrificing performance by using tools like Databricks Delta, how to best monitor your streaming jobs and alert if your streams are falling behind or are actually failing, as well as how to best test your code."
An in-depth presentation on the WAND top-k retrieval algorithm for efficiently finding the top-k relevant documents for a given query from the inverted index. Compares performance of WAND with naive solutions.
An introductory presentation about the current state of personalization in (Web) search for Bibliotekarforbundet's series of 'gå-hjem-møder'. Presented on May 17, 2016 at Aalborg University Copenhagen.
Predictive Analytics enables organisations to forecast future events, analyse risks and opportunities, and automate decision making processes by analysing historic data.
LuceneRDD for (Geospatial) Search and Entity Linkagezouzias
In this talk, I will present the design and implementation of LuceneRDD for Apache Spark. LuceneRDD instantiates an inverted index on each Spark executor and collects / aggregates search results from Spark executors to the Spark driver. The main motivation behind LuceneRDD is to natively extend Spark's capabilities with full-text search, geospatial search and entity linkage without requiring an external dependency of a SolrCloud or Elasticsearch cluster.
As a case study, we will show how LuceneRDD can tackle the entity linkage problem. We will demonstrate both the flexibility and efficiency of LuceneRDD for this problem. First, we will show that LuceneRDD's interface provide a highly flexible approach to its users for entity linkage. This flexibility is due to Lucene's powerful query language that is able to combine multiple full-text queries such as term, prefix, fuzzy and phrase queries. Second, we will focus on the efficiency and scalability of LuceneRDD by linking records between two relatively large datasets.
Lastly and time permitting, I will present ShapeLuceneRDD which enhances LuceneRDD with geospatial queries.
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
This Edureka Spark Tutorial will help you to understand all the basics of Apache Spark. This Spark tutorial is ideal for both beginners as well as professionals who want to learn or brush up Apache Spark concepts. Below are the topics covered in this tutorial:
1) Big Data Introduction
2) Batch vs Real Time Analytics
3) Why Apache Spark?
4) What is Apache Spark?
5) Using Spark with Hadoop
6) Apache Spark Features
7) Apache Spark Ecosystem
8) Demo: Earthquake Detection Using Apache Spark
MtpMolti p-value nella stessa analisi: necessità e metodi di correzione (Livi...Francesco Cabiddu
Slides settimo intervento giornata 24 Maggio 2013 :
"Una Statistica più consapevole per decisioni migliori.
Giornata di Metodologia e Statistica per le Scienze Umane."
Pomeriggio: La Statistica nelle Ricerche in Psicologia.
Università degli studi di Cagliari. Dipartimento di Pedagogia, Psicologia e Filosofia.
Università di Cagliari.
TITOLO: Molti p-value nella stessa analisi: necessità e metodi di correzione.
(L. Finos)
Università di Padova
ABSTRACT:
Durante l'analisi di un dataset è uso comune postulare molteplici ipotesi sperimentali. Per rispondere a tali ipotesi si fa uso di altrettanti test e p-value ad essi associati. Questo è il caso tipico, ad esempio, di due gruppi sperimentali che vengano confrontati su più di scale o il caso di più di due gruppi confrontati a due a due su una medesima scala. In questi casi risulta necessario estendere il concetto di errore di primo tipo al caso multidimensionale. Le definizioni largamente più accettate sono il FamilyWise Error Rate e il False Discovery Rate. Le ultime tre decadi hanno visto il fiorire di un gran numero di metodi per il controllo di questi due errori di primo tipo (in ambito multidimensionale). In questo seminario verranno presentati e discussi in modo critico i metodi sopracitati e presentati i principali metodi per il controllo della molteplicità. Si faranno anche alcuni brevi accenni alle prospettive future.
Learning to rank (LTR) for information retrieval (IR) involves the application of machine learning models to rank artifacts, such as items to be recommended, in response to user's need. LTR models typically employ training data, such as human relevance labels and click data, to discriminatively train towards an IR objective. The focus of this tutorial will be on the fundamentals of neural networks and their applications to learning to rank.
a simple presentation about different big data stream processing systems such as SPARK, SAMZA and STORM and the difference between their architectures and purpose, in addition we talk about streaming layers tools such as Kafka and rabbitMQ, this presentation refer to this paper
https://vsis-www.informatik.uni-hamburg.de/getDoc.php/publications/561/Real-time%20stream%20processing%20for%20Big%20Data.pdf and other useful links.
Feature engineering--the underdog of machine learning. This deck provides an overview of feature generation methods for text, image, audio, feature cleaning and transformation methods, how well they work and why.
ANI | Business Agility Day @Gurugram | Are you a responsible Business | Dilje...AgileNetwork
Abstract:
In today's ever-changing environment, every business needs to deliver to its customers and stakeholders. Customers need the best value for money and convenience and Stakeholders need Return on their investment. Until business need, this cusp and ready to embrace the changes needed, would not be a successful business. This requires the utmost ability to respond to the changes in the environment. This requires business agility.
Key Takeaways:
1. Why business agility is very crucial in today's environment?
2. How to be agile as a business
3. Role of technology in this agility
4. Common principles between Business Agility and Software Agility
5. Ingredients to business agility.
Business and IT alignment through effective Project & Program Portfolio Manag...Alan Kan
Business and IT alignment through effective Project & Program Portfolio Management.
Presented at IBM Innovate 2011 in Sydney and Melbourne in Australia in July 2011.
"Structured Streaming was a new streaming API introduced to Spark over 2 years ago in Spark 2.0, and was announced GA as of Spark 2.2. Databricks customers have processed over a hundred trillion rows in production using Structured Streaming. We received dozens of questions on how to best develop, monitor, test, deploy and upgrade these jobs. In this talk, we aim to share best practices around what has worked and what hasn't across our customer base.
We will tackle questions around how to plan ahead, what kind of code changes are safe for structured streaming jobs, how to architect streaming pipelines which can give you the most flexibility without sacrificing performance by using tools like Databricks Delta, how to best monitor your streaming jobs and alert if your streams are falling behind or are actually failing, as well as how to best test your code."
An in-depth presentation on the WAND top-k retrieval algorithm for efficiently finding the top-k relevant documents for a given query from the inverted index. Compares performance of WAND with naive solutions.
An introductory presentation about the current state of personalization in (Web) search for Bibliotekarforbundet's series of 'gå-hjem-møder'. Presented on May 17, 2016 at Aalborg University Copenhagen.
Predictive Analytics enables organisations to forecast future events, analyse risks and opportunities, and automate decision making processes by analysing historic data.
LuceneRDD for (Geospatial) Search and Entity Linkagezouzias
In this talk, I will present the design and implementation of LuceneRDD for Apache Spark. LuceneRDD instantiates an inverted index on each Spark executor and collects / aggregates search results from Spark executors to the Spark driver. The main motivation behind LuceneRDD is to natively extend Spark's capabilities with full-text search, geospatial search and entity linkage without requiring an external dependency of a SolrCloud or Elasticsearch cluster.
As a case study, we will show how LuceneRDD can tackle the entity linkage problem. We will demonstrate both the flexibility and efficiency of LuceneRDD for this problem. First, we will show that LuceneRDD's interface provide a highly flexible approach to its users for entity linkage. This flexibility is due to Lucene's powerful query language that is able to combine multiple full-text queries such as term, prefix, fuzzy and phrase queries. Second, we will focus on the efficiency and scalability of LuceneRDD by linking records between two relatively large datasets.
Lastly and time permitting, I will present ShapeLuceneRDD which enhances LuceneRDD with geospatial queries.
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
This Edureka Spark Tutorial will help you to understand all the basics of Apache Spark. This Spark tutorial is ideal for both beginners as well as professionals who want to learn or brush up Apache Spark concepts. Below are the topics covered in this tutorial:
1) Big Data Introduction
2) Batch vs Real Time Analytics
3) Why Apache Spark?
4) What is Apache Spark?
5) Using Spark with Hadoop
6) Apache Spark Features
7) Apache Spark Ecosystem
8) Demo: Earthquake Detection Using Apache Spark
MtpMolti p-value nella stessa analisi: necessità e metodi di correzione (Livi...Francesco Cabiddu
Slides settimo intervento giornata 24 Maggio 2013 :
"Una Statistica più consapevole per decisioni migliori.
Giornata di Metodologia e Statistica per le Scienze Umane."
Pomeriggio: La Statistica nelle Ricerche in Psicologia.
Università degli studi di Cagliari. Dipartimento di Pedagogia, Psicologia e Filosofia.
Università di Cagliari.
TITOLO: Molti p-value nella stessa analisi: necessità e metodi di correzione.
(L. Finos)
Università di Padova
ABSTRACT:
Durante l'analisi di un dataset è uso comune postulare molteplici ipotesi sperimentali. Per rispondere a tali ipotesi si fa uso di altrettanti test e p-value ad essi associati. Questo è il caso tipico, ad esempio, di due gruppi sperimentali che vengano confrontati su più di scale o il caso di più di due gruppi confrontati a due a due su una medesima scala. In questi casi risulta necessario estendere il concetto di errore di primo tipo al caso multidimensionale. Le definizioni largamente più accettate sono il FamilyWise Error Rate e il False Discovery Rate. Le ultime tre decadi hanno visto il fiorire di un gran numero di metodi per il controllo di questi due errori di primo tipo (in ambito multidimensionale). In questo seminario verranno presentati e discussi in modo critico i metodi sopracitati e presentati i principali metodi per il controllo della molteplicità. Si faranno anche alcuni brevi accenni alle prospettive future.
Learning to rank (LTR) for information retrieval (IR) involves the application of machine learning models to rank artifacts, such as items to be recommended, in response to user's need. LTR models typically employ training data, such as human relevance labels and click data, to discriminatively train towards an IR objective. The focus of this tutorial will be on the fundamentals of neural networks and their applications to learning to rank.
a simple presentation about different big data stream processing systems such as SPARK, SAMZA and STORM and the difference between their architectures and purpose, in addition we talk about streaming layers tools such as Kafka and rabbitMQ, this presentation refer to this paper
https://vsis-www.informatik.uni-hamburg.de/getDoc.php/publications/561/Real-time%20stream%20processing%20for%20Big%20Data.pdf and other useful links.
Feature engineering--the underdog of machine learning. This deck provides an overview of feature generation methods for text, image, audio, feature cleaning and transformation methods, how well they work and why.
ANI | Business Agility Day @Gurugram | Are you a responsible Business | Dilje...AgileNetwork
Abstract:
In today's ever-changing environment, every business needs to deliver to its customers and stakeholders. Customers need the best value for money and convenience and Stakeholders need Return on their investment. Until business need, this cusp and ready to embrace the changes needed, would not be a successful business. This requires the utmost ability to respond to the changes in the environment. This requires business agility.
Key Takeaways:
1. Why business agility is very crucial in today's environment?
2. How to be agile as a business
3. Role of technology in this agility
4. Common principles between Business Agility and Software Agility
5. Ingredients to business agility.
Business and IT alignment through effective Project & Program Portfolio Manag...Alan Kan
Business and IT alignment through effective Project & Program Portfolio Management.
Presented at IBM Innovate 2011 in Sydney and Melbourne in Australia in July 2011.
Hypothesis-Driven Development & How to Fail-Fast Hacking GrowthPrabhat Gupta
Why startups need to fail-fast & experiment more
Framework for Faster Experimentation
Prioritizing right and focussed experiments
Setting up team, process, culture & KPIs
Tools, Holistic Architecture enabling quick implementations of teams with lean team of Fronend & Backend developers
The Right Data Warehouse: Automation Now, Business Value ThereafterInside Analysis
The Briefing Room with Dr. Robin Bloor and WhereScape
Live Webcast on April 1, 2014
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=7b23b14b532bd7be60a70f6bd5209f03
In the Big Data shuffle, everyone is looking at Hadoop as “the answer” to collect interesting data from a new set of sources. While Hadoop has given organizations the power to gather more information assets than ever before, the question still looms: which data, regardless of source, structure, volume and all the rest, are significant for affecting business value – and how do we harness it? One effective approach is to bolster the data warehouse environment with a solution capable of integrating all the data sources, including Hadoop, and automating delivery of key information into the rights hands.
Register for this episode of The Briefing Room to hear veteran Analyst Robin Bloor as he explains how a rapidly changing information landscape impacts data management. He will be briefed by Mark Budzinski of WhereScape, who will tout his company’s data warehouse automation solutions. Budzinski will discuss how automation can be the cornerstone for closing the gap between those responsible for data management and the people driving business decisions.
Visit InsideAnlaysis.com for more information.
The product development cycle for startups - everything from coming up with an idea,to validating it, building it, launching it, and measuring how well the thing you built performed against your hypothesis!
Future directives in erp, erp and internet, critical success and failure factorsVarun Luthra
This ppt explains Future Directives in ERP, ERP and Internet, its critical success and failure factors, Hit 'Like' button if the ppt turns out to be useful for you in any way. Enjoy :)
Continuous auditing and monitoring (“continuous reviews”) has been discussed for decades but implemented in moderation based on recent surveys. It comes down to how much are data analytics integrated into our audit processes initially to then become continuous. If a high degree of integration exists, then there is probably a good amount of continuous reviews happening in the organization already.
However, most companies fall into the other camp and have not integrated analytics well enough or considered how to take full advantage of continuous reviews.
This course will explain culturally what audit departments must do to embrace continuous reviews and how that can be integrated with ACL Desktop software techniques. Sample files and scripts will be provided to get you started down the road to continuous reviews.
As regulatory changes sweep the globe, auditors, risk management, and compliance professionals are using more sophisticated tools, and methods.
Using a live/video training library approach, we help companies of all sizes use audit and assurance software to improve business intelligence, increase efficiencies, identify fraud, test controls, and bottom line savings.
AuditNet and Cash Recovery Partners Webinar recording available at auditsoftwarevideos.com and AuditNet.tv (registration required) Recording free to view.
Sample Data Files for All Courses are available for $49
To purchase access to all sample data files, Excel macros and ACL scripts associated with the free training visit AuditSoftwareVideos.
By focusing on organizational enablers and robust software engineering practices, e-commerce companies can shorten the development lifecycle, outmaneuver the competition and remain relevant in the eyes of customers.
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysNEWYORKSYS-IT SOLUTIONS
NEWYORKSYSTRAINING are destined to offer quality IT online training and comprehensive IT consulting services with complete business service delivery orientation.
[DSC Europe 22] The Making of a Data Organization - Denys HolovatyiDataScienceConferenc1
Data teams often struggle to deliver value. KPIs, data pipelines, or ML driven predictions aren't inherently useful - unless the data team enables the business to use them. Having worked on 37 data projects over the past 5 years, with total client revenue clocking at about $350B, I started noticing simple success factors - and summarized those in the Operating Model Canvas & the Value Delivery Process. With those, I branched out into what I call data organization consulting and help clients build their data teams for success, the one you see not only on paper but also in your P&L. In this talk, I'll share some insight with you.
Similar to Data Science Challenges and Impact at Lazada (Big Data and Analytics Innovation Summit Singapore 2018) (20)
Recommender Systems: Beyond the user-item matrixEugene Yan Ziyou
Recommendation systems. They're a pretty old topic that started way back in the 1990s.
A meetup on it sounds like it'll be boring... if we only talked about the standard user-item matrix collaborative filtering on big data systems.
Thankfully, for this meetup, we'll be sharing on how we can adopt some more recent techniques to recommend products, including social media graphs (and random walks), sequences (and NLP), and PyTorch. The sharing will cover everything starting from data acquisition and preparation, implementation of multiple techniques, and result comparisons. Some familiarity with Python and PyTorch would be useful; minimal math required.
Healthcare expenditure is set to rise over the coming years. Cost will undoubtedly influence patients’ decision-making when it comes to diagnosis and treatment.
For healthcare providers, providing up-front cost estimates improves patient experience, making patients more willing to return (if required) in the future. For patients, having accurate pre-admission estimates allow for informed decisions and adequate preparation, reducing payment challenges after treatment. Ultimately, this case is a first step towards (i) standardization of healthcare cost estimation and (ii) price transparency to build trust between healthcare providers, payers, and patients.
OLX Group Prod Tech 2019 Keynote: Asia's Tech GiantsEugene Yan Ziyou
- Scaling across multiple properties while centralising capabilities
- How to decide what to centralise / decentralise?
- Alibaba & Grab: How do they scale across multiple commerce sites?
- SuperApps in China and Southeast Asia
- Why / why not go the SuperApp approach?
- WeChat & Grab: SuperApps of Asia
- Case Study: Alibaba’s playbook for integrating acquisitions (Lazada and Daraz)
- What were the key tactics and priorities?
- Lessons learnt
Here are the values and culture Lazada Data Science lives by daily to fulfil our mission of using data to serve our buyers, sellers, and Lazadians. If this appeals to you, reach out to me!
Sharing about my data science journey and what I do at LazadaEugene Yan Ziyou
Was invited to share with the SMU Masters of IT in Business students on (i) how I got to my current position as a data scientist and (ii) what I do in my current position.
Includes suggested areas to focus on (e.g., distributed systems and processing) and how to gain more experience (e.g., volunteering). I also go through the problems that we solve at Lazada using machine learning and a high level architecture of how we do it.
Garuda Robotics x DataScience SG Meetup (Sep 2015)Eugene Yan Ziyou
What exactly goes on in the commercial drone/UAV industry in Singapore and globally? Behind the hype of consumer “selfie” drones lies a vast number of interesting commercial applications, where drones become an enabler for enterprises to gain new aerial perspectives of their facilities and estates, to make intelligent decisions incorporating this additional dimension of data.
In this presentation, we will look at one such drones-at-work application to reveal some of the behind-the-scene processes and technologies employed. Specifically, we will dive into the precision agriculture domain and share some of the computer vision problems we face, and take a look at various potential solutions to these challenges.
DataKind SG sharing on our first DataDive with Humanitarian Organization for Migration Economics (HOME) and Earth Hour.
Know of other non-profits we can help? Reach out to singapore@datakind.org or drop me a note =)
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntEugene Yan Ziyou
Our team achieved 85th position out of 3,514 at the very popular Kaggle Otto Product Classification Challenge. Here's an overview of how we did it, as well as some techniques we learnt from fellow Kagglers during and after the competition.
Here's a summarised version of the slides shared by Nielsen at the DataScience SG meetup on 20 Apr 2015. Thanks to our generous speakers for sharing on their data science endeavours =D
Statistical inference: Statistical Power, ANOVA, and Post Hoc testsEugene Yan Ziyou
This deck was used in the IDA facilitation of the John Hopkins' Data Science Specialization course for Statistical Inference. It covers the topics in week 4 (statistical power, ANOVA, and post hoc tests).
The data and R script for the lab session can be found here: https://github.com/eugeneyan/Statistical-Inference
Statistical inference: Hypothesis Testing and t-testsEugene Yan Ziyou
This deck was used in the IDA facilitation of the John Hopkins' Data Science Specialization course for Statistical Inference. It covers the topics in week 3 (hypothesis testing and t tests).
The data and R script for the lab session can be found here: https://github.com/eugeneyan/Statistical-Inference
Statistical inference: Probability and DistributionEugene Yan Ziyou
This deck was used in the IDA facilitation of the John Hopkins' Data Science Specialization course for Statistical Inference. It covers the topics in week 1 (probability) and week 2 (distribution).
A Study on the Relationship between Education and Income in the USEugene Yan Ziyou
What is the relationship between education and income? Is education truly the great equalizer or do factors such as gender and family income at the age of 16 affect current income?
As part of the Coursera Data Analysis and Statistical Inference course, these issues were examined using data from the US General Social Survey in R.
Diving into Twitter data on consumer electronic brandsEugene Yan Ziyou
Which consumer electronic brands get tweeted about most? Which brands have more positive/negative sentiment? To find out, 15.3 gb of tweets was downloaded from 13 - 25 May using Python and then analysed in R.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
3. Lazada Data
Data App Devs expose, integrate, platform-ize
Data Scientists explore, prepare, model
Data Engineers collect, store, maintain
Start from bottom up
5. How much business
input/overriding?
Trade-off: Manual human input vs. automated algorithms
Necessary to some extent, but harmful if overdone
Technically, manual input and rules are difficult to maintain
6. How much business
input/overriding?
Example: Manual override of product ranking on the site
Allows category managers to incorporate their domain
knowledge (e.g., new product releases, trending, etc.)
Nonetheless, too much manual overriding reduced metrics
Conducted AB tests to find optimal level of manual overriding
7. How fast is “too fast”?
Trade-off: Development speed vs. production stability
You can move faster without building tooling/abstractions, code
reviews, automated testing, repaying technical debt, documentation
But in the long run, they save time and effort
FB: “Move fast and break things” -> “Move fast with stable infra”
8. How fast is “too fast”?
Moar features!
Quick POC
Automation,
testing, tooling,
clear tech debt
Environment in place
Project size
Effort
Production
Dev SpeedStability
Less effort and faster =)
More effort and slower =(
Dev, dev, dev
Development effort over the long run
9. How fast is “too fast”?
Example: 8 man team, 10 problems—mostly focused on delivery
In the first two years, the team achieved a lot and proved our worth
Nonetheless, as we matured and had to maintain more production
code, investing in iteration speed and code quality had high ROI
10. How to set priorities with
business?
Trade-off: Short-term vs. long term
Business understands best what is needed, though can be overly
focused on day-to-day ops and near term goals
Data science is aware of the latest research and can innovate, but
risks being detached from business needs
11. How to set priorities with
business?
Example: Timebox-ed skunkworks resulting in POCs
Data leadership sponsored some POCs that were hacked together
in 2 – 4 weeks—some eventually made it into production
Nonetheless, the focus is on research and innovation that can be
applied to improve the online shopping experience
15. Overall results
Significant manpower cost savings (5-figures monthly)
Existing workforce can be diverted to difficult-to-automate tasks
Reduced lead-time before reviews are live on site
19. Web Tracker
(JavaScript)
Mobile Tracker
(Adjust)
3rd Party
(e.g. ,ZenDesk,
SurveyGizmo)
Kafka Queues
Bulk Loaders
(Spark)
Hadoop
Hadoop
Data
Exploration
+
Data
Preparation
+
Feature
Engineering
+
Modelling
(Spark)
Manual
Boosting
(Django)
Local
Validation
A/B
Testing
Product
Seller
Transaction
Product rankings
Split traffic and measure outcomes
(Category Managers)
(User devices)
20. Overall results
Better ranking improved conversion (3 – 8%) and revenue per
session (5 – 20%)
Introducing new products improved new product engagement
(CTR increased 30 – 80%; add-to-cart increased 20 – 90%)
Emphasizing product quality had neutral to positive outcomes
(reduced return rate; increased product net promoter score)
21. Key takeaways
There is no single best answer to the challenges raised—it
depends on the maturity stage of the team and organization
Data science > Coding + Machine Learning—many other
activities contribute greatly to the final impact