网页理解
零干预将网站还原为数据
给定入口链接,柏拉图 AI 识别、浏览并解读最重要的链出页,输出全部字段:
select * from harvest('https://www.amazon.com/b?node=3117954011');
AI 已浏览120个网页,已理解8组数据共142个字段。 以下显示第2组数据,该组数据共包含10个字段,对应网页区域 #centerCol
C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 | |||
1 | Amazon.com: BLACK+DECKER 6 quart 11-in-1 Cooking Pot, Stainless Steel, Pressure Cooker, Slow Cooker, Multi-Cooker, PR100 | BLACK+DECKER 6 quart 11-in-1 Cooking Pot, Stainless Steel, Pressure Cooker, Slow Cooker, Multi-Cooker, PR100 | by | BLACK+DECKER | 4.2 out of 5 stars | 129 ratings | | | 89 answered questions | + No Import Fees Deposit & ¥40.72 Shipping to Hong Kong | New (5) from | ¥54.17 | |
2 | Amazon.com: BLACK+DECKER 6 quart 11-in-1 Cooking Pot, Stainless Steel, Pressure Cooker, Slow Cooker, Multi-Cooker, PR100 | BLACK+DECKER 6 quart 11-in-1 Cooking Pot, Stainless Steel, Pressure Cooker, Slow Cooker, Multi-Cooker, PR100 | by | BLACK+DECKER | 4.2 out of 5 stars | 129 ratings | | | 89 answered questions | + No Import Fees Deposit & ¥40.72 Shipping to Hong Kong | New (5) from | ¥54.17 | |
3 | Amazon.com: Crock Pot 6 Quart 8 in 1 Multi Use Express Crock Programmable Pressure Cooker, Slow Cooker, Sauté & Steamer | Stainless Steel (SCCPPC60... | Crock Pot 6 Quart 8 in 1 Multi Use Express Crock Programmable Pressure Cooker, Slow Cooker, Sauté & Steamer | Stainless Steel (SCCPPC600 V1) | by | Crockpot | 4.2 out of 5 stars | 2,086 ratings | | | 670 answered questions | There is a newer model of this item: | New (31) from | ¥74.79 | |
4 | Amazon.com: Crockpot Thermoshield 6 Quart Manual Slow Cooker, Black | Crockpot Thermoshield 6 Quart Manual Slow Cooker, Black | by | Crockpot | 4.1 out of 5 stars | 150 ratings | | | 47 answered questions | + No Import Fees Deposit & ¥47.40 Shipping to Hong Kong | New & Used (12) from | ¥59.99 | |
5 | Amazon.com: GoWISE USA GW22637 4th-Generation Electric Pressure Cooker with rice scooper, and measuring cup, 14 QT | GoWISE USA GW22637 4th-Generation Electric Pressure Cooker with rice scooper, and measuring cup, 14 QT | by | GoWISE USA | 3.9 out of 5 stars | 927 ratings | | | 498 answered questions | + No Import Fees Deposit & ¥70.96 Shipping to Hong Kong | New & Used (4) from | ¥113.18 | |
6 | Amazon.com: GoWISE USA GW22637 4th-Generation Electric Pressure Cooker with rice scooper, and measuring cup, 14 QT | GoWISE USA GW22637 4th-Generation Electric Pressure Cooker with rice scooper, and measuring cup, 14 QT | by | GoWISE USA | 3.9 out of 5 stars | 927 ratings | | | 498 answered questions | + No Import Fees Deposit & ¥70.96 Shipping to Hong Kong | New & Used (4) from | ¥113.18 | |
7 | Amazon.com: GoWISE USA GW22637 4th-Generation Electric Pressure Cooker with rice scooper, and measuring cup, 14 QT | GoWISE USA GW22637 4th-Generation Electric Pressure Cooker with rice scooper, and measuring cup, 14 QT | by | GoWISE USA | 3.9 out of 5 stars | 927 ratings | | | 498 answered questions | + No Import Fees Deposit & ¥70.96 Shipping to Hong Kong | New & Used (4) from | ¥113.18 | |
8 | Amazon.com: Gourmia GPC400 4 Qt Digital Multi-Mode SmartPot Pressure Cooker - 13 Cook Modes - Removable Pot - 24-Hour Delay Timer - Automatic Keep ... | Gourmia GPC400 4 Qt Digital Multi-Mode SmartPot Pressure Cooker - 13 Cook Modes - Removable Pot - 24-Hour Delay Timer - Automatic Keep Warm - LCD Display - Pressure Sensor Lid Lock - Recipe Book | by | Gourmia | 4.2 out of 5 stars | 363 ratings | | | 171 answered questions | + No Import Fees Deposit & ¥31.80 Shipping to Hong Kong | |||
9 | Amazon.com: Mealthy MultiPot 9-in-1 Programmable Pressure Cooker 6 Quarts with Stainless Steel Pot, Steamer Basket, instant access to recipe app. P... | Mealthy MultiPot 9-in-1 Programmable Pressure Cooker 6 Quarts with Stainless Steel Pot, Steamer Basket, instant access to recipe app. Pressure cook, slow cook, sauté, rice cooker, yogurt, steam | by | Mealthy | 4.7 out of 5 stars | 1,593 ratings | | | 934 answered questions | New & Used (3) from | ¥169.99 | ||
10 | Amazon.com: Ninja Instant, 1000-Watt Pressure, Slow, Multi Cooker, and Steamer with 6-Quart Ceramic Coated Pot & Steam Rack (PC101), Si, Black/Silver | Ninja Instant, 1000-Watt Pressure, Slow, Multi Cooker, and Steamer with 6-Quart Ceramic Coated Pot & Steam Rack (PC101), Si, Black/Silver | by | Ninja | 4.7 out of 5 stars | 120 ratings | | | 65 answered questions | This product is available as Renewed. | New & Used (11) from | ¥54.95 | |
11 | Amazon.com: Power Pressure Cooker XL 10 Qt | Power Pressure Cooker XL 10 Qt | by | Power Pressure Cooker XL | 4.1 out of 5 stars | 2,977 ratings | | | 1000+ answered questions | + No Import Fees Deposit & ¥51.68 Shipping to Hong Kong | New & Used (6) from | ¥159.00 | |
12 | Amazon.com: Presto 02141 6-Quart Electric Pressure Cooker, Stainless, Black, Silver | Presto 02141 6-Quart Electric Pressure Cooker, Stainless, Black, Silver | by | Presto | 4.2 out of 5 stars | 54 ratings | | | 17 answered questions | + No Import Fees Deposit & ¥38.45 Shipping to Hong Kong | New & Used (33) from | ¥59.99 |
产品介绍
人工智能体军团
- 人工智能 - 人工智能驱动的网页挖掘技术,零干预或极少干预,超大规模网页完整精确还原为数据
- 弹性计算 - 分布式网页渲染引擎满足任意规模的数据采集需求
- 商业智能 - 在 Web 上实施商业智能,捕捉成千上万高价值事件,回答利益攸关的业务问题
- X-SQL - 架构在 Web 上的 SQL 引擎,Web 和本地数据库同等对待
-- 将一组亚马逊产品页转变成本地表
select
dom_base_uri(dom) as `url`,
dom_first_text(dom, '#productTitle') as `title`,
str_substring_after(dom_first_href(dom, '#wayfinding-breadcrumbs_container ul li:last-child a'), '&node=') as `category`,
dom_first_slim_html(dom, '#bylineInfo') as `brand`,
cast(dom_all_slim_htmls(dom, '#imageBlock img') as varchar) as `gallery`,
dom_first_slim_html(dom, '#landingImage, #imgTagWrapperId img, #imageBlock img:expr(width > 400)') as `img`,
dom_first_text(dom, '#price tr td:contains(List Price) ~ td') as `listprice`,
dom_first_text(dom, '#price tr td:matches(^Price) ~ td') as `price`,
str_first_float(dom_first_text(dom, '#reviewsMedley .AverageCustomerReviews span:contains(out of)'), 0.0) as `score`
from load_out_pages('https://www.amazon.com/b?node=3117954011', 'a[href~=/dp/]', 1, 10);
执行
关键难题
网络数据处理存在以下关键算法难题
- 自动网页提取 - 无人干预自动结构化互联网规模网页
- AI辅助网页提取 - 零干预或极少干预,将大规模网页完整精确结构化
行为良好的企业级网络数据处理系统需要解决以下工程问题
- 增强分析 - 提供机器学习、知识图谱等AI技术来增强数据分析
- 机器学习 - 支持机器学习算法来降低数据处理门槛并提高效率
- 云化服务 - 支持云化服务以降低使用门槛并提高交付效率
- 操作语言 - 支持数据操作语言以简化远程数据操作
- 质量保证 - 系统质量保证,大规模采集下的数据质量和调度质量保证
- 性能优化 - 采集单元并行化以最大化利用硬件资源,修改浏览器内核以提升性能
- 弹性计算 - 支持弹性计算以实现无缝扩展,从而获得处理互联网规模数据的能力
- 健壮性 - 应对复杂的网络环境,完整严格的异常处理和重试机制。
- 存储处理 - 完整的工具链处理网络大数据的存储问题
- 运维工具 - 提供完整的指标和日志,运维工具以实时获取系统运行状态并对对系统进行诊断和维护
- 全流程 - 从采集网页等原始数据到结论形成、报表生成整个流程
- 其它问题 - 获取成本、技能要求、数据规模、数据融合、时效价值、可维护性等
SAAS
curl -X POST --location "http://platonic.fun:8182/api/x/e" -H "Content-Type: text/plain" -d "
select
dom_base_uri(dom) as url,
dom_first_text(dom, '#productTitle') as title,
str_substring_after(dom_first_href(dom, '#wayfinding-breadcrumbs_container ul li:last-child a'), '&node=') as category,
dom_first_slim_html(dom, '#bylineInfo') as brand,
cast(dom_all_slim_htmls(dom, '#imageBlock img') as varchar) as gallery,
dom_first_slim_html(dom, '#landingImage, #imgTagWrapperId img, #imageBlock img:expr(width > 400)') as img,
dom_first_text(dom, '#price tr td:contains(List Price) ~ td') as listprice,
dom_first_text(dom, '#price tr td:matches(^Price) ~ td') as price,
str_first_float(dom_first_text(dom, '#reviewsMedley .AverageCustomerReviews span:contains(out of)'), 0.0) as score
from load_and_select('https://www.amazon.com/dp/B07XJ8C8F7 -i 20s', 'body');"
支持 X-SQL 的 REST API
- 业务模型映射 - 使用 X-SQL 完成从网页数据到本地业务模型的转换
- DATA API - 柏拉图的弹性计算使得规模化 Web 数据唾手可得
- 高阶 SaaS - X-SQL 灵活的内置函数,提供进一步的数据处理能力:情绪判定,知识图谱构建等
- 领域 SaaS - 对常见领域,柏拉图已内置开箱即用的解决方案
成本节约 相比传统方案,使用柏拉图管理外部数据,我们至少为客户减少了一半人员开支和一半硬件投入
数据规模 基于柏拉图的机器学习技术,我们现在能够获得网站的几乎全部字段,并且再没有数据提取规则维护的烦恼
交付时效 柏拉图简单在万维网上应用商业智能,相比传统手段的采集规则制定、采集入库、数据清洗、BI 报表流程, 交付时效提高 90% 以上
数据质量 传统手工提取数据,大概能够获得极少量网站的 50% 左右字段,使用柏拉图前沿的数据挖掘技术,能够获得任意规模网站 95% 以上数据
解决方案
告诉我们您在进行何种类型的项目
百思买批量计算折扣
select
dom_first_number(dom, '.priceView-customer-price') as `price`,
dom_first_number(dom, '.pricing-price__regular-price') as `list-price`,
dom_first_number(dom, '.pricing-price__regular-price') - dom_first_number(dom, '.priceView-customer-price') as `saving`
from
load_out_pages('https://www.bestbuy.com/site/promo/laptop-and-computer-deals', 'h4.sku-header a')
亚马逊新品跟踪
select
dom_first_text(dom, 'span.zg-item a > div:expr(img=0 && char>10)') as title,
dom_first_text(dom, '.p13n-sc-price') as `price`,
str_substring_between(dom_first_attr(dom, 'span.zg-item div a i.a-icon-star', 'class'), ' a-star-', ' ') as score
from load_and_select('https://www.amazon.com/gp/new-releases/home-garden/ref=zg_bsnr_nav_0', 'ol#zg-ordered-list li.zg-item-immersion')
客户评价
他们这么说 。。。
杨锦全
总经理 & 合伙人
使用柏拉图,我们现在每天采集一千万电商数据,相比原本预算,硬件成本减少了一半,产品研发周期缩短到了三个月。
徐玉海
总经理
使用柏拉图采集海外新闻数据后,团队可以把精力放到我们熟悉的舆情分析上,这为我们的团队管理效率带来了巨大提升。
邱维明
总经理 & 合伙人
柏拉图的 Web 数据管理系统使得我们的数据产品创意总可以在第一时间得到实现,客户常常惊讶于我们的原型交付能力。
参考价格方案
选择最适合您的方案
团队介绍
张斌
总经理 & 创始人姚尧
首席运营官许飞龙
首席咨询师褚雪忠
首席架构师常见问题
柏拉图是如何实现自动网页结构化的?
柏拉图考察了网页的几何、拓扑、代码结构和语义等各方面的特征,将网页的每一个 DOM 元素建模为流形(manifold)上带属性的矩形,然后进行标准机器学习处理。
柏拉图由什么语言写成?
柏拉图解决方案包含多种编程语言。核心数据引擎的主要语言是 kotlin/java,少量 c++/javascript/bash/html/css 等,核心引擎超过 30 万行源代码。配套子项目包含了 clojure/reactjs 等。
柏拉图是否支持开源?
是的,柏拉图核心引擎和 Web BI 系统均已经开源。
可以使用哪些编程语言来获得柏拉图 SaaS 服务?
柏拉图解决方案提供标准的 SQL 语言支持以及 REST API,客户端各种编程语言都能够轻松调用,大多数情况下只需要简单发送一个 REST 请求即可。
柏拉图为什么要支持 SQL?
我们多年研究网络数据处理问题,希望以一种最优的方式去治理外部数据。将互联网同本地数据库同等对待是最佳方式。在后续版本里,柏拉图会支持流式 SQL,以完整符合网络数据的流式特征。
联系我们
柏拉图
加入柏拉图,开启企业级 Web 数据管理革命。
galaxyeye@live.cn
+86🌱186❧2153☙8660