Enhancing AI-Driven User Engagement with Real-Time Data Streaming via Flink.pptx
Virtuslab
21 views
19 slides
Jun 17, 2024
Slide 1 of 19
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
About This Presentation
A case study of how we enhanced an AI-driven user engagement with real-time data streaming via Flink
Size: 12.34 MB
Language: en
Added: Jun 17, 2024
Slides: 19 pages
Slide Content
Enhancing AI-Driven User Engagement with Real-Time Data Streaming via Flink Zbigniew Królikowski Senior ML Engineer @VirtusLab
Our client, a worldwide retailer, runs an e-commerce platform. Content displayed to the customer is personalised based on individual purchase history . The existing approach was based on batch data processing and model training that introduced a significant delay between customer actions on the website and tailored content recommendations, leading to missed sales opportunities . Client’s platform allows for continuous tracking and processing customer actions with minimal latency . Up to this point this capability wasn’t leveraged. Our Machine Learning engineering team undertook the challenge to build an online data processing and model training solution that improves customer engagement in order to generate more sales opportunities for the client. The opportunity
From Machine Learning perspective our main goal was to improve the metric called Click-Through Rate ( CTR ), which is built on a statistic of individual users seeing a particular piece of content and engaging with it. Engagement with content leads to sales opportunities impacting the Revenue per Customer ( RPC ) as well Revenue per Basket ( RPB ). Although both RPC and RPB have a direct link to profit and were actively monitored during A/B testing through analytics means. CTR is more practical as a running metric as it can be calculated on-the-fly. Business metrics we were interested in
Events describing user actions are fed through Kafka. Our solution supports backwards-compatibility for different event versions. Events contain information about user identity , displayed content, user’s basket as well as user’s interactions within the page. A number of them needs to be taken account to form a complete customer story. All events types have to be fed through a specialised filtering logic as well correctly joined and aggregated to build the model training features . Events are broadcasted from all locations on the mobile app and website while select subsets of this data have to be efficiently routed into dedicated machine learning models . The data
Customer offers a mobile application as well as a web service. Our solution is capable of supporting both of these as well as potential extensions for on-site use-cases. Training a single model, on one platform, requires processing events for each individual customer exposed the that piece of content. This, on average, is close to 2000 events per second. There are multiple locations, on each platform, that can be enhanced with the move to the online solution enabled by Flink. In order to provide complete coverage , the solution needs to be able to scale up to tens of thousands of events per second , while keeping costs manageable during development ramp-up. The data: volume
Uptime SLA’s are fulfilled and the system needs to be able to recover from transient failure automatically . Data is never lost and the whole process is always recoverable . User identity remains hidden at all times . No bias is introduced to the models through the data ensuring fair treatment . Training is based solely on relevant characteristics . Personalisation is only delivered to logged-in users who have expressly agreed to participate. Other requirements
Apache Flink is a stream processing software designed from ground up with those tenants is mind: Correctness guarantees - reproducible and consistent results Layered APIs - viable for all team skill sets Operational focus - production ready from the start Scalability - from minimal use-cases to core business Performance - no-compromise approach to low-latency, high-throughput and minimal operational costs. Flink offers excellent integration with a number of queue systems including Kafka as well as SQL and NoSQL databases. Apache Flink is an Open Source Software available under the Apache License 2.0. Apache Flink
In contrast to stateless processing, stateful processing enables a broader set of operations that encompass more than one event . This opens-up the possibilities for more sophisticated business processes to be represented within the code. Apache Flink supports multiple back-ends for storing state to best fit different scalability, throughput, and latency requirements: in-memory or a key-value database (RocksDB). Stateful processing
Apache Flink supports both batch and streaming processing and guarantees full reproducibility of results across both approaches. This requires close to zero code changes and works with each of Flink’s API’s. Both bounded and unbounded streams are supported. Image source: https://flink.apache.org/what-is-flink/flink-architecture/ Batch processing and Stream processing This feature makes Flink capable of covering the use-cases as batch processing tools like Apache Spark. This has potential to reduce the amount of different tooling necessary for a project.
Big part of what makes Flink a good fit for the client is the very powerful feature of event-time processing. By using this feature, risk associated with a scalable, dynamic environment is mitigated and we achieve full reproducibility . As the logical time is untied from wall-clock time application gain data consistency the focus can be directed towards keeping latency as low as possible. Development process is also greatly enhanced, as the logic can be rolled back to any point time allowing for a replicable development environment. Moreover the same applies to the root-cause-analysis process. Production issues can be resolved swiftly and reliably reducing the ongoing maintenance costs. Event-time processing
Flink supports two application programming interfaces (APIs) with a large overlap of features: Table API (SQL) - suitable for BI specialists , analysts and data scientists . DataStream API (available for Python, Java and Scala ) - suitable for data engineers and ML engineers . This makes Flink fit seamlessly into existing proficiencies of the team, without the need of time-investment necessary for re-skilling. Available languages and APIs Image source: https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/concepts/overview/
System architecture
Flink can be easily integrated with cloud platform as well as on premise environment through the use of the Flink Kubernetes Operator which handles many aspects of the deployment. Deployment image source: https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/img/overview.svg
Flink offers complete consistency of results through checkpointing which will either be run on a very small schedule (seconds to minutes) or continuously . In case of issues Flink application is able to reload from that point and seamlessly catch-up with minimal disruption . Resilience When deployed on Kubernetes, Flink supports a high-availability (HA) deployment method in which application stability is maintained in the case of failure of the JobManager, entity that supervises the scheduling and resource management. Industry-standard ZooKeeper can be used instead. This makes HA possible on non-kubernetes platforms. Image source: https://nightlies.apache.org/flink/flink-docs-master/docs/concepts/stateful-stream-processing/
The deployment will adapt to changing load patterns by increasing or reducing the allocation of resources based on the load. This lowers the operating costs , while at the same safeguards from deterioration of the service level in time os increased customer traffic. Auto-scaling Apache Flink will automatically re-adjust it’s internal allocation of resource to individual processing steps based on quality characteristics. Through this approach, the administrative and development costs can be reduced at the same time ensuring operational efficiency. Image source: https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.7/docs/custom-resource/autoscaler
Flink offers excellent in-built monitoring capability with a web UI: Provides real-time statistics of input and data sizes and message counts for every step of the processing. Collects all application logs in single location for seamless browsing and auditing. Displays information regarding checkpoints providing clear understanding of stability of the applications. The same capability is available during the development process ensuring quality and clarity of operation from the early stages. Additionally we’ve integrated the Flink instance with an external cloud monitoring tools giving us a uniform monitoring capability across all cloud deployments. Monitoring
System architecture - adjusted
Our solution gave an 35% uptick in CTR for duration of the A/B trial. P erformance has been maintained over time. This shows that this method has prevented concept/model drift. The results