OpenHuFu

OpenHuFu

开源数据联邦系统实现安全高效查询处理

OpenHuFu是一个开源数据联邦系统,旨在解决数据孤岛问题,实现跨数据所有者的安全高效查询处理。系统采用安全多方计算技术,包括秘密共享、混淆电路和不经意传输,为研究人员提供灵活平台,用于快速实现和评估联邦查询处理算法。OpenHuFu支持关系查询和空间查询等多种类型,并提供通信成本和运行时间等评估指标。该系统为数据联邦和联邦学习研究提供了重要工具。

OpenHuFu数据联邦系统安全查询多方安全计算空间数据Github开源项目

OpenHuFu: An Open-Sourced Data Federation System

codecov License Total Lines

OpenHuFu is the open-sourced version of the spatial data federation system, Hu-Fu, in our VLDB'22 paper.

The appendix of the full paper can be referred to this PDF.

Project Description

Data isolation has become an obstacle to scale up query processing over big data, since sharing raw data among data owners is often prohibitive due to security concerns. A promising solution is to perform secure queries and analytics over a federation of multiple data owners leveraging techiniques like secure multi-party computation (SMC) and differential privacy, as evidenced by recent work on data federation and federated learning.

OpenHuFu is an open-sourced system for efficient and secure query processing on a data federation. It provides flexibility for researchers to quickly implement their algorithms for processing federated queries with SMC techniques, such as secret sharing, garbled circuit and oblivious transfer. With its help, we can quickly conduct the experimental evaluation and obtain the performance of the designed algorithms over benchmark datasets.

Compile OpenHuFu from Source Code

Prerequisites:

  • Linux or MacOS
  • Java 11
  • Maven (version at least 3.5.2)
  • C++ (generate TPC-H data)
  • Python3 (generate spatial data)
  • Git & Git LFS (Git Large File Storage)

Build OpenHuFu

Run the following commands:

  1. Clone OpenHuFu repository:
git clone https://github.com/BUAA-BDA/OpenHuFu.git
  1. Download big files from Git LFS(Large File Storage)
cd OpenHuFu git lfs install --skip-smudge git lfs pull
  1. Build:
cd OpenHuFu bash scripts/build/package.sh

OpenHuFu is now installed in release

Notes

If you use MacsOS, you need to add this to settings.xml(maven settings file):

<profiles> <profile> <id>macos</id> <properties> <os.detected.classifier>osx-x86_64</os.detected.classifier> </properties> </profile> </profiles> <activeProfiles> <activeProfile>macos</activeProfile> </activeProfiles>

Data Generation

Relational data: TCP-H

How to use it:

bash scripts/test/extract_tpc_h.sh cd dataset/TPC-H\ V3.0.1/dbgen cp makefile.suite makefile # If you use MacOS, you need to replace '#include <malloc.h>' with #include <sys/malloc.h> in dbgen make # Go to the root folder cd ../../.. # x is the number of database,y is the volume of each database(MB) bash scripts/test/generateData.sh x y

Spatial data

Spatial sample data: dataset/newyork-taxi-sample.data:

How to use it

Generate spatial data:

pip3 install numpy python3 scripts/test/genSyntheticData.py databaseNum dataSize [distribution name] [params]

The distributions we support and their params are as follow:

Distributionparam1param2
unilow (default = -1e7)high (default = 1e7)
normu (default = 0)sigma (default = 1e5)
expmu (default = 5e6)

(If needed, you can modify scripts/test/genSyntheticData.py)

Notes

Each table is defined by two files in CSV and SCM format, and the names of the files serve as the actual names of the tables. <br/> The CSV file contains the column names and the data of the table, while the SCM file contains the column names and column types. The delimiter is used to separate different column fields, and it can be specified in the owner's configuration file.

Configuration File

OwnerSide

UserSide

Development procedure

  1. Develop your algorithms
  • Aggregate:
class extends com.hufudb.openhufu.owner.implementor.aggregate.OwnerAggregateFunction /** * The class must contains a constructor function with parameters: * (OpenHuFuPlan.Expression agg, Rpc rpc, ExecutorService threadPool, OpenHuFuPlan.TaskInfo taskInfo) */
  • Join:
class implements com.hufudb.openhufu.owner.implementor.join.OwnerJoin
  1. Set the algorithm for the query(example in owner.yaml):
openhufu: implementor: aggregate: sum: com.hufudb.openhufu.owner.implementor.aggregate.sum.SecretSharingSum max: null min: null join: com.hufudb.openhufu.owner.implementor.join.HashJoin
  1. Build OpenHuFu

    Follow the instructions in Section Build OpenHuFu to build the project.

  2. Run OpenHuFu

    We provide sample configurations for 3 owners in release/config folder. <br/> You can use the configuration to run our demo on a single machine, or modify the configuration files to deploy OpenHuFu on multiple machines. <br/>

    Please note that since the configuration files use relative paths, we need to cd release before running the command.

    Run demo on a single machine:

    bash owner_all.sh

    Run OpenHuFu on multiple machines:

    bash owner.sh start ./config/owner{i}.json

    Stop OpeHuFu:

    bash owner.sh stop
  3. Run benchmarks

bash benchmark.sh
  1. Evaluating communication cost

Before running benchmarks on OpenHuFu, you can follow the instructions to evaluate communication cost of the query:

  • Monitoring the port
# run the shell script as root # 8888 is the port number sudo bash scripts/test/network_mmonitor/start.sh 8888
  • Calculating the communication cost
# run the shell script as root sudo bash scripts/test/network_mmonitor/monitor.sh

Data Query Language

  1. Plan
  2. Function Call

Supported Query Types

  • Filter
  • Projection
  • Join
    • equi join
    • theta join
  • Cross products
  • Aggregate(inc. group-by)
  • Limited window aggs
  • Distinct
  • Sort
  • Limit
  • Common table expressions
  • Spatial Queries:
    • range query
    • range counting
    • knn query
    • distance join
    • knn join

Evaluation Metrics

  • Communication Cost
  • Running Time
    • Total Query Time
    • Local Query Time
    • Encryption Time
    • Decryption Time

Related Work

If you find OpenHuFu helpful in your research, please consider citing our papers and the bibtex are listed below:

  1. Hu-Fu: Efficient and Secure Spatial Queries over Data Federation. Yongxin Tong, Xuchen Pan, Yuxiang Zeng, Yexuan Shi, Chunbo Xue, Zimu Zhou, Xiaofei Zhang, Lei Chen, Yi Xu, Ke Xu, Weifeng Lv. Proc. VLDB Endow. 15(6): 1159-1172 (2022). [paper] [slides] [bibtex]

Other helpful related work from our group is listed below:

  1. DM-PFL: Hitchhiking Generic Federated Learning for Efficient Shift-Robust Personalization. Wenhao Zhang, Zimu Zhou, Yansheng Wang, Yongxin Tong. KDD 2023. [paper] [bibtex]

  2. Efficient Approximate Range Aggregation Over Large-Scale Spatial Data Federation. Yexuan Shi, Yongxin Tong, Yuxiang Zeng, Zimu Zhou, Bolin Ding, Lei Chen. IEEE Trans. Knowl. Data Eng. 35(1): 418-430 (2023). [paper] [bibtex]

  3. Hu-Fu: A Data Federation System for Secure Spatial Queries. Xuchen Pan, Yongxin Tong, Chunbo Xue, Zimu Zhou, Junping Du, Yuxiang Zeng, Yexuan Shi, Xiaofei Zhang, Lei Chen, Yi Xu, Ke Xu, Weifeng Lv. Proc. VLDB Endow. 15(12): 3582-3585 (2022). [paper] [bibtex]

  4. Data Source Selection in Federated Learning: A Submodular Optimization Approach. Ruisheng Zhang, Yansheng Wang, Zimu Zhou, Ziyao Ren, Yongxin Tong, Ke Xu. DASFAA 2022. [paper] [bibtex]

  5. Fed-LTD: Towards Cross-Platform Ride Hailing via Federated Learning to Dispatch. Yansheng Wang, Yongxin Tong, Zimu Zhou, Ziyao Ren, Yi Xu, Guobin Wu, Weifeng Lv. KDD 2022. [paper] [bibtex]

  6. Efficient and Secure Skyline Queries over Vertical Data Federation. Yuanyuan Zhang, Yexuan Shi, Zimu Zhou, Chunbo Xue, Yi Xu, Ke Xu, Junping Du. IEEE Trans. Knowl. Data Eng. (2022). [paper] [bibtex]

  7. Federated Topic Discovery: A Semantic Consistent Approach. Yexuan Shi, Yongxin Tong, Zhiyang Su, Di Jiang, Zimu Zhou, Wenbin Zhang. IEEE Intell. Syst. 36(5): 96-103 (2021). [paper] [bibtex]

  8. Industrial Federated Topic Modeling. Di Jiang, Yongxin Tong, Yuanfeng Song, Xueyang Wu, Weiwei Zhao, Jinhua Peng, Rongzhong Lian, Qian Xu, Qiang Yang. ACM Trans. Intell. Syst. Technol. 12(1): 2:1-2:22 (2021). [paper] [bibtex]

  9. A GDPR-compliant Ecosystem for Speech Recognition with Transfer, Federated, and Evolutionary Learning. Di Jiang, Conghui Tan, Jinhua Peng, Chaotao Chen, Xueyang Wu, Weiwei Zhao, Yuanfeng Song, Yongxin Tong, Chang Liu, Qian Xu, Qiang Yang, Li Deng. ACM Trans. Intell. Syst. Technol. 12(3): 30:1-30:19 (2021). [paper] [bibtex]

  10. An Efficient Approach for Cross-Silo Federated Learning to Rank. Yansheng Wang, Yongxin Tong, Dingyuan Shi, Ke Xu. ICDE 2021. [paper] [slides] [bibtex]

  11. Federated Learning in the Lens of Crowdsourcing. Yongxin Tong, Yansheng Wang, Dingyuan Shi. IEEE Data Eng. Bull. 43(3): 26-36 (2020). [paper] [bibtex]

  12. Federated Latent Dirichlet Allocation: A Local Differential Privacy Based Framework. Yansheng Wang, Yongxin Tong, Dingyuan Shi. AAAI 2020. [paper] [bibtex]

  13. Federated Acoustic Model Optimization for Automatic Speech Recognition. Conghui Tan, Di Jiang, Huaxiao Mo, Jinhua Peng, Yongxin Tong, Weiwei Zhao, Chaotao Chen, Rongzhong Lian, Yuanfeng Song, Qian Xu. DASFAA 2020. [paper] [bibtex]

  14. Efficient and Fair Data Valuation for Horizontal Federated Learning. Shuyue Wei, Yongxin Tong, Zimu Zhou, Tianshu Song. Federated Learning 2020. [paper] [bibtex]

  15. Profit Allocation for Federated Learning. Tianshu Song, Yongxin Tong, Shuyue Wei. IEEE BigData 2019. [paper] [slides] [bibtex]

  16. Federated Machine Learning: Concept and Applications. Qiang Yang, Yang Liu, Tianjian Chen, Yongxin Tong. ACM Trans. Intell. Syst. Technol. 10(2): 12:1-12:19 (2019). [paper]

编辑推荐精选

Vora

Vora

免费创建高清无水印Sora视频

Vora是一个免费创建高清无水印Sora视频的AI工具

Refly.AI

Refly.AI

最适合小白的AI自动化工作流平台

无需编码,轻松生成可复用、可变现的AI自动化工作流

酷表ChatExcel

酷表ChatExcel

大模型驱动的Excel数据处理工具

基于大模型交互的表格处理系统,允许用户通过对话方式完成数据整理和可视化分析。系统采用机器学习算法解析用户指令,自动执行排序、公式计算和数据透视等操作,支持多种文件格式导入导出。数据处理响应速度保持在0.8秒以内,支持超过100万行数据的即时分析。

AI工具使用教程AI营销产品酷表ChatExcelAI智能客服
TRAE编程

TRAE编程

AI辅助编程,代码自动修复

Trae是一种自适应的集成开发环境(IDE),通过自动化和多元协作改变开发流程。利用Trae,团队能够更快速、精确地编写和部署代码,从而提高编程效率和项目交付速度。Trae具备上下文感知和代码自动完成功能,是提升开发效率的理想工具。

热门AI工具生产力协作转型TraeAI IDE
AIWritePaper论文写作

AIWritePaper论文写作

AI论文写作指导平台

AIWritePaper论文写作是一站式AI论文写作辅助工具,简化了选题、文献检索至论文撰写的整个过程。通过简单设定,平台可快速生成高质量论文大纲和全文,配合图表、参考文献等一应俱全,同时提供开题报告和答辩PPT等增值服务,保障数据安全,有效提升写作效率和论文质量。

数据安全AI助手热门AI工具AI辅助写作AI论文工具论文写作智能生成大纲
博思AIPPT

博思AIPPT

AI一键生成PPT,就用博思AIPPT!

博思AIPPT,新一代的AI生成PPT平台,支持智能生成PPT、AI美化PPT、文本&链接生成PPT、导入Word/PDF/Markdown文档生成PPT等,内置海量精美PPT模板,涵盖商务、教育、科技等不同风格,同时针对每个页面提供多种版式,一键自适应切换,完美适配各种办公场景。

热门AI工具AI办公办公工具智能排版AI生成PPT博思AIPPT海量精品模板AI创作
潮际好麦

潮际好麦

AI赋能电商视觉革命,一站式智能商拍平台

潮际好麦深耕服装行业,是国内AI试衣效果最好的软件。使用先进AIGC能力为电商卖家批量提供优质的、低成本的商拍图。合作品牌有Shein、Lazada、安踏、百丽等65个国内外头部品牌,以及国内10万+淘宝、天猫、京东等主流平台的品牌商家,为卖家节省将近85%的出图成本,提升约3倍出图效率,让品牌能够快速上架。

iTerms

iTerms

企业专属的AI法律顾问

iTerms是法大大集团旗下法律子品牌,基于最先进的大语言模型(LLM)、专业的法律知识库和强大的智能体架构,帮助企业扫清合规障碍,筑牢风控防线,成为您企业专属的AI法律顾问。

SimilarWeb流量提升

SimilarWeb流量提升

稳定高效的流量提升解决方案,助力品牌曝光

稳定高效的流量提升解决方案,助力品牌曝光

Sora2视频免费生成

Sora2视频免费生成

最新版Sora2模型免费使用,一键生成无水印视频

最新版Sora2模型免费使用,一键生成无水印视频

下拉加载更多